USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
PAGE 1 Ener gy and T ransient Po wer Minimization During Beha vioral Synthesis by Saraju P Mohanty A dissertation submitted in partial fulllment of the requirements for the de gree of Doctor of Philosophy Department of Computer Science and Engineering Colle ge of Engineering Uni v ersity of South Florida Major Professor: N. Ranganathan, Ph.D. Murali V aranasi, Ph.D. Srini v as Katk oori, Ph.D. W ilfredo A. Moreno, Ph.D. A. N. V Rao, Ph.D. Date of Appro v al: October 17, 2003 K e yw ords: peak po wer a v erage po wer po wer uctuation, lo w po wer synthesis, datapath scheduling, multiple supply v oltages, dynamic frequenc y clocking, multic ycling, digital w atermarking cCop yright 2003, Saraju P Mohanty PAGE 2 DEDICA TION My state Kalinga (Orissa), W orld' s lar gest democrac y (India), W orld' s oldest democrac y (USA), my P arents, my Sisters, Uma, and to e v ery one who has taught me free thinking. PAGE 3 A CKNO WLEDGEMENTS I w ould lik e to e xpress gratitude to my major professor Dr N. Ranganathan, for his guidance and support throughout my doctoral de gree program. I w ould sincerely lik e to thank Dr K. R. Ramakrishan, Dr Mohan S. Kanakanhalli, Dr Chitta Baral, Dr Rabi N. Mahapatra, Dr Debasmita Misra, Dr Srini v as Katk oori and Dr Sanjukta Bhanja for there support in v arious phases of my student life. Special thanks to Dr D. Rundus, Dr R. Perez, Dr Goldgof and all the members of my Ph.D. committee. I w ould also lik e to thank all members of VCAPP group (such as, Ashok, Sunil, Ra vi, Karthik, Suv odeep, Mouli, Bamini, Stelian, Hao, Pra v een, etc.) for their help and cooperation. Special thanks to Dr Austell, ISSS of ce at USF the of ce staf fs of CSE department at USF and technical support staf f of CSE department at USF (Daniel). Last b ut not the least, I thank all my friends (Uma, Rupesh, Siddy Ajaya, Lulu, P ati, Prince, Bhabani, Dur ga, Amaresh, Krishna, Rajib, Sridhar Saroj, Jai, Hari, etc.), who ha v e al w ays been a constant source of moral support. PAGE 4 T ABLE OF CONTENTS LIST OF T ABLES v LIST OF FIGURES viii ABSTRA CT xiii CHAPTER 1 INTR ODUCTION 1 1.1 Fundamentals of High Le v el Synthesis 4 1.1.1 Why HighLe v el Synthesis ? 7 1.1.2 V arious Phases of HighLe v el Synthesis 8 1.1.3 A Synthesis Example 12 1.2 Sources of Po wer Dissipation in a CMOS Circuit 12 1.3 Methods for Po wer Reduction in HighLe v el Synthesis 16 1.4 Why Peak Po wer Minimization ? 18 1.5 Why A v erage Po wer and Ener gy Reduction ? 19 1.6 Why T ransient Po wer Minimization ? 20 1.7 Why Frequenc y and V oltage Scaling ? 20 1.8 Multiple Supply V oltages, Dynamic Clocking and Multic ycling Preliminaries 21 1.8.1 What is Dynamic Frequenc y Clocking ? 22 1.8.2 Ener gy or Po wer Reduction Due to V oltage or Frequenc y Scaling 22 1.8.3 Issues in Multiple Supply V oltage Based Design 25 1.8.4 Le v el Con v erter Design 26 1.8.5 Dynamic Frequenc y Clocking Unit Design 27 1.9 Fundamentals of Digital W atermarking 31 1.9.1 General Frame w ork for W atermarking 32 1.9.2 T ypes of W atermarking 35 1.10 Contrib utions of this Dissertation 38 1.11 Dissertation Outline 40 CHAPTER 2 RELA TED W ORK 41 2.1 Datapath Scheduling for Ener gy or A v erage Po wer Reduction using V oltage Reduction 42 2.2 Switching Acti vity Reduction During HighLe v el Synthesis 47 2.3 Datapath Scheduling for Peak Po wer Reduction 55 2.4 Scheduling for V ariable V oltage Processor 57 2.5 Design and Synthesis for Lo wPo wer or HighPerformance V ariable V oltage / Frequenc y / Latenc y and Multiple V oltage Based Systems 65 i PAGE 5 2.6 Hardw are Based Digital W atermarking Systems 72 2.7 This Dissertation 73 CHAPTER 3 ENERGY MINIMIZA TION 75 3.1 T ar get Architecture and Datapath Specifications 75 3.2 T ime Constrained Scheduling 77 3.2.1 Algorithm Flo w 78 3.2.2 Pseudocode Description 80 3.2.3 T ime Comple xity 82 3.3 Resource Constrained Scheduling 84 3.3.1 Algorithm Flo w 86 3.3.2 Pseudocode of the Resource Constrained Algorithm 87 3.3.3 T ime Comple xity 90 3.4 Experimental Results 91 3.5 Conclusions 96 CHAPTER 4 ENERGY DELA Y PR ODUCT MINIMIZA TION 98 4.1 Ener gy Delay Product of a Datapath Circuit 98 4.2 ILP F ormulations 102 4.2.1 ILP F ormulations : Dynamic Frequenc y Clocking 102 4.2.2 ILP F ormulations : Multic ycling 103 4.3 Datapath Scheduling Algorithm 105 4.3.1 Scheduling for MVDFC 105 4.3.2 Scheduling for MVMC 106 4.4 Experimental Results 110 4.5 Conclusions 113 CHAPTER 5 PEAK PO WER AND A VERA GE PO WER MINIMIZA TION 114 5.1 Peak and A v erage Po wer Consumption of a Datapath Circuit 114 5.2 ILP F ormulations 117 5.2.1 ILP F ormulations for DFC 117 5.2.2 ILP F ormulations for Multic ycling 119 5.3 ILPBased Scheduler 120 5.3.1 Scheduler using Multiple V oltages and Dynamic Frequenc y Clocking 121 5.3.2 Scheduler using Multiple Supply V oltages and Multic ycling 124 5.4 Experimental Results 126 5.5 Peak Po wer Minimization 128 5.5.1 ILP F ormulations 128 5.5.1.1 Multiple Supply V oltages and Dynamic Frequenc y Clocking (MVDFC) 130 5.5.1.2 Multiple Supply V oltages and Multic ycling (MVMC) 131 5.5.2 ILPBased Scheduler 132 5.5.2.1 Scheduling for MVDFC 132 5.5.2.2 Scheduling for MVMC 133 5.5.3 Experimental Results 139 ii PAGE 6 5.6 Conclusions 142 CHAPTER 6 ENERGY AND TRANSIENT PO WER MINIMIZA TION 143 6.1 Cycle Po wer Function (CPF) 144 6.1.1 Model 1 : CPF using Mean De viation 145 6.1.2 Model 2 : CPF using CycletoCycle Gradient 148 6.2 CPFScheduler Algorithm 150 6.3 Experimental Results 157 6.4 Conclusions 164 CHAPTER 7 TRANSIENT PO WER MINIMIZA TION 166 7.1 Modified Cycle Po wer Function 167 7.2 Modeling of Nonlinearities 170 7.2.1 LP F ormulation In v olving Sum of Absolute De viations 170 7.2.2 LP F ormulation In v olving Fraction 171 7.3 ILP F ormulations to Minimize Cycle Po wer Function 172 7.3.1 Multiple V oltages and Dynamic Frequenc y Clocking (MVDFC) 173 7.3.2 Multiple V oltages and Multic ycling (MVMC) 176 7.4 ILPBased Scheduling Algorithm 179 7.4.1 CPFMVDFC Scheduling Scheme 181 7.4.2 CPFMVMC Scheduling Scheme 182 7.5 Experimental Results 183 7.6 Conclusions 189 CHAPTER 8 PO WER FLUCTU A TION MINIMIZA TION 193 8.1 Po wer Fluctuation Modeling 194 8.2 Modeling of Nonlinearities 197 8.3 ILP F ormulations to Minimize Mean Po wer Gradient 199 8.3.1 F ormulations using Multiple V oltages and Dynamic Frequenc y 199 8.3.2 F ormulations using Multiple Supply V oltages and Multic ycling 201 8.4 Scheduling Algorithm 204 8.5 Experimental Results 207 8.6 Conclusions 213 CHAPTER 9 VLSI DESIGN FOR DIGIT AL W A TERMARKING OF IMA GES 214 9.1 In visible W atermarking in Spatial Domain 214 9.1.1 Spatial Domain In visible W atermarking Algorithms 216 9.1.1.1 In visible Rob ust Algorithm 216 9.1.1.2 In visible Fragile Algorithm 218 9.1.2 VLSI Architecture for In visible Spatial Domain W atermarking 220 9.1.2.1 Architecture for Rob ust W atermarking 220 9.1.2.2 Architecture for Fragile W atermarking 222 9.1.2.3 Ov erall Chip Architecture 222 9.1.3 Implementation of Spatial Domain In visible W atermarking Chip 223 9.1.4 Results and Conclusions 227 iii PAGE 7 9.2 V isible W atermarking in Spatial Domain 229 9.2.1 W atermarking Algorithms 229 9.2.1.1 V isible W atermarking Algorithm 1 : 229 9.2.1.2 V isible W atermarking Algorithm 2 : 231 9.2.2 VLSI Architecture 234 9.2.2.1 Architecture for Algorithm 1 : 234 9.2.2.2 Architecture for Algorithm 2 : 236 9.2.2.3 Architecture for the W atermarking Processor : 238 9.2.3 Chip Implementation 239 9.2.4 Results and Conclusions 243 9.3 In visible and V isible W atermarking in DCT Domain 245 9.3.1 W atermarking Algorithms 246 9.3.1.1 Spread Spectrum In visible W atermarking Insertion Algorithm 246 9.3.1.2 V isible W atermarking Insertion Algorithm 248 9.3.1.3 Algorithm Modification for Hardw are Implementations 249 9.3.2 VLSI Architecture 250 CHAPTER 10 CONCLUSIONS AND FUTURE W ORK 256 REFERENCES 258 ABOUT THE A UTHOR End P age i v PAGE 8 LIST OF T ABLES T able 2.1 Datapath Scheduling Schemes using Multiple Supply V oltages 45 T able 2.2 HighLe v el Synthesis Schemes using Switching Acti vity Reduction 51 T able 2.3 Relati v e Performance of V arious Schemes Proposed for Peak Po wer Minimization 55 T able 2.4 Scheduling Algorithms for V ariable V oltage Processor 60 T able 2.5 Design and Synthesis W orks on V ariable Frequenc y or Multiple Frequenc y 67 T able 2.6 W atermarking Chips Proposed in Current Literature 73 T able 3.1 List of Functions used in the TCDFC Algorithm 79 T able 3.2 List of V ariables and Data Structures used in the TCDFC Algorithm Description 80 T able 3.3 TCDFC Freqeunc y Selection : from leftright 80 T able 3.4 V erte x Priority List 80 T able 3.5 Cycle Priority List :nrnr82 T able 3.6 Cycle Priority List :r82 T able 3.7 Frequenc y Selection (From Left to Right in Each Step) 85 T able 3.8 Resource Lookup T able (order From Left to Right) 85 T able 3.9 List of Functions used in the RCDFC Algorithm 87 T able 3.10 List of V ariables and Data Structures used in the RCDFC Algorithm Description 89 T able 3.11 Resource Constraints used in our Experiements 93 T able 3.12 Ener gy Details for Dif ferent Benchmarks (for! "#) using RCDFC Scheduler 94 T able 3.13 Configurations for Minimum EDP using RCDFC 95 v PAGE 9 T able 3.14 Ener gy Sa vings using TCDFC Scheduler 95 T able 3.15 Sa vings for V arious Resource Constrained Schedulings 97 T able 3.16 Sa vings for V arious T ime Constrained Schedulings 97 T able 4.1 Notations used in Description 100 T able 4.2 Notations used in ILP F ormulations 102 T able 4.3 Ener gy and EDP Estimates for Benchmarks for MVDFC and MVMC Schemes 111 T able 4.4 Sa vings for V arious Schedulings Schemes 113 T able 5.1 Notations used in Description 115 T able 5.2 Notations used in ILP F ormulations 117 T able 5.3 Notations used in Expressing Results 127 T able 5.4 Resource Constraints used for our Experiement 128 T able 5.5 Peak Po wer A v erage Po wer and PDP Estimates for Benchmarks using Scheduling Schemes 129 T able 5.6 Peak and A v erage Po wer Reduction for V arious Scheduling Schemes 131 T able 5.7 Resource Constraints used for our Experiment 139 T able 5.8 Po wer Estimates for MVDFC and MVMC Scheduling Schemes 140 T able 5.9 Po wer Reduction for V arious Scheduling Schemes 141 T able 6.1 List of Notataions and T erminology used in CPF Modeling 144 T able 6.2 Notations used to Express the Results 158 T able 6.3 Po wer Estimates for Dif ferent Benchmarks (using Model 1) 159 T able 6.4 Po wer Estimates for Dif ferent Benchmarks (using Model 2) 163 T able 7.1 List of V ariables used in ILP F ormulations 173 T able 7.2 List of V ariables used to Express the Results 184 T able 7.3 Po wer Ener gy and EDP Estimates for Benchmarks using MVDFC 186 T able 7.4 Po wer ener gy and EDP Estimates for Benchmarks using MVMC 187 T able 8.1 Notations used in the Description 195 vi PAGE 10 T able 8.2 Notations used in ILP formulations 199 T able 8.3 Notations used in Describing the Results 208 T able 8.4 Po wer Estimates for Benchmarks 209 T able 9.1 Notations used to Explain Spatial Domain W atermarking Algorithms 216 T able 9.2 Control Signals for Spatial Domain In visible W atermarking Chip 224 T able 9.3 Po wer Area Details for Indi vidual Units 225 T able 9.4 Ov erall Chip Statistics 226 T able 9.5 List of V ariables used in Algorithm Explanation 230 T able 9.6 Po wer and Area of Dif ferent Units 242 T able 9.7 Ov erall Statistics of the W atermarking Chip 243 T able 9.8 Notations used in the Description of the Algorithm 247 T able 9.9 Ov erall Statistics of the DCT Domain W atermarking Chip [85 ] 255 vii PAGE 11 LIST OF FIGURES Figure 1.1 Chronological Change in Po wer Po wer Density T ransistor Count, Gate Count, Operating Frequenc y and Feature Size of CMOS Integrated Circuits 2 Figure 1.2 Desription of Hardw are in Dif ferent Domains and Abstractions [4 ] 5 Figure 1.3 Synthesis Flo w 6 Figure 1.4 V arious Phases of HighLe v el Synthesis 8 Figure 1.5 Data Flo w Graph and Control Flo w Graph of a Square Root Algorithm [3 ] 10 Figure 1.6 Dif ferent T ypes of Scheduling Algorithms 11 Figure 1.7 A Synthesis Example : Step 1 to Step 3 13 Figure 1.8 The Synthesis Example : Step 4 to Step 6 14 Figure 1.9 Sources of Po wer Dissipation in a CMOS Circuit 15 Figure 1.10 Static Vs Dynamic Po wer Dissipation for Dif ferent Switching Acti vity [6 7 ] 17 Figure 1.11 Dynamic Frequenc y Generation using Dynamic Clocking Unit [54 ] 23 Figure 1.12 Data Flo w Graph in Three Modes of Operation 24 Figure 1.13 Le v el Con v erter Schematic Diagram [65 66 ] 27 Figure 1.14 Le v el Con v erter Layout and Simulation 28 Figure 1.15 Dynamic Clocking Unit : Ranganathan, et. al. [59 ] 29 Figure 1.16 Dynamic Clocking Unit and Output Clock : Byrnjolfson and Zilic [61 ] 30 Figure 1.17 V isible W atermark ed Image [71 ] 32 Figure 1.18 General Frame w ork of Digital W atermarking 34 Figure 1.19 Dif ferent T ypes of W atermarks and W atermarking T echniques 36 viii PAGE 12 Figure 1.20 Contrib utions of this Dissertation 38 Figure 1.21 Ener gy Vs Peak Po wer Ef ficient Schedule 39 Figure 2.1 V ariable V oltage Processor Operation : V oltage Vs Frequenc y [122 ] 58 Figure 3.1 Le v el Con v erters Needed for Stepping up Signal 76 Figure 3.2 HAL Dif ferential Equation Solv er (with ASAP labels) 77 Figure 3.3 TCDFC Scheduling Algorithm Flo w 78 Figure 3.4 Pseudocode for TCDFC Scheduling Algorithm 81 Figure 3.5 Schedules Obtained for HAL Benchmark for Dif ferent T ime Constraints using TCDFC 83 Figure 3.6 RCDFC Scheduling Algorithm Flo w 86 Figure 3.7 Pseudocode for RCDFC Scheduler 88 Figure 3.8 Final Schedule of FIR Filter DFG (using RCDFC) 91 Figure 3.9 A v erage Ener gy and EDP Reduction for Benchmarks 96 Figure 4.1 ILP Based Scheduling for Lo w EDP 105 Figure 4.2 Example Data Flo w Graph for Multiple Supply V oltages and Dynamic Frequenc y Clocking 106 Figure 4.3 ILP F ormulation for Example DFG for Multiple Supply V oltages and Dynamic Frequenc y Clocking 107 Figure 4.4 Example DFG (for RC2) (MVMC) 108 Figure 4.5 ILP F ormulation for Example DFG for Multiple Supply V oltages and Multic ycling 109 Figure 4.6 Reduction for Dif ferent Benchmarks Expressed as Percentage in A v erage 112 Figure 5.1 ILPBased Scheduler 121 Figure 5.2 Example DFG for Resource Constraint RC3; using Multiple Supply V oltages and Dynamic Frequenc y Clocking 122 Figure 5.3 ILP F ormulation for Example DFG using DFC, for RC3 and Switching Acti vity ="#123 Figure 5.4 Example DFG for Resource Constraint RC3; using Multiple Supply V oltages and Multic ycling 124 ix PAGE 13 Figure 5.5 ILP F ormulation for Example DFG using Multic ycling, for RC3 and Switching Acti vity ="#125 Figure 5.6 A v erage Reduction for Dif ferent Bechmarks 130 Figure 5.7 Example DFG (for RC1) (MVDFC) 133 Figure 5.8 ILP F ormulation for Example DFG (MVDFC) 134 Figure 5.9 ILP F ormulation for Example DFG (MVDFC) in AMPL 135 Figure 5.10 Example DFG (for RC1) (MVMC) 136 Figure 5.11 ILP F ormulation for Example DFG (MVMC) 137 Figure 5.12 ILP F ormulation for Example DFG (MVMC) in AMPL 138 Figure 5.13 A v erage Reductions for Benchmarks 141 Figure 6.1 The CPFScheduler Algorithm Flo w 152 Figure 6.2 The CPFScheduler Algorithm Heuristic 153 Figure 6.3 Cycle Po wer Consumptions for Resource Constraint RC1 161 Figure 6.4 Cycle Po wer Consumptions for Resource Constraint RC2 161 Figure 6.5 Cycle Po wer Consumptions for Resource Constraint RC3 162 Figure 6.6 Cycle Po wer Consumptions for Resource Constraint RC4 162 Figure 6.7 Percentage A v erage Reduction for Benchmarks using Model1 164 Figure 6.8 Percentage A v erage Reduction for Benchmarks using Model2 165 Figure 7.1 Scheduling for$%'&)(Minimization 180 Figure 7.2 ASAP and ALAP Schedule for Example DFG (used to find Mobility Graph) 181 Figure 7.3 Mobility Graph and Final Schedule for Example DFG for RC5 using MVDFC 182 Figure 7.4 Mobility Graph and Final Schedule for Example DFG for RC5 using MVMC 183 Figure 7.5 A v erage Reductions in Po wer or Ener gy for Benchmarks using CPFMVDFC 188 Figure 7.6 A v erage Reductions for Benchmarks using CPFMVMC 189 x PAGE 14 Figure 7.7 Po wer Profile for Benchmark for Resource Constraint RC1 190 Figure 7.8 Po wer Profile for Benchmark for Resource Constraint RC2 191 Figure 7.9 Po wer Profile for Benchmark for Resource Constraint RC3 191 Figure 7.10 Po wer Profile for Benchmark for Resource Constraint RC4 192 Figure 7.11 Po wer Profile for Benchmark for Resource Constraint RC5 192 Figure 8.1 Scheduling for*+%,Minimization 205 Figure 8.2 Example Data Flo w Graph (DFG) 206 Figure 8.3 A v erage Reductions using DFC Scheme 210 Figure 8.4 A v erage Reductions using Multic ycling Scheme 211 Figure 8.5 Po wer Profiles for Benchmarks (for RC2) 212 Figure 8.6 Po wer Profiles for Benchmarks (for RC3) 212 Figure 8.7 Po wer Profiles for Benchmarks (for RC5) 213 Figure 9.1 Secure JPEG Encoder : Block Le v el V ie w [176 ] 215 Figure 9.2 Secure Digital Still Camera : Schematic V ie w 215 Figure 9.3 In visible Rob ust W atermarking in Spatial Domain [177 178 ] 217 Figure 9.4 In visible Fragile W atermarking in Spatial Domain [83 72] 219 Figure 9.5 Datapath for Rob ust W atermarking 220 Figure 9.6 Datapath for Fragile W atermarking 221 Figure 9.7 Datapath F or Combined Spatial Domain In visible Rob ust / Fragile W atermarking 222 Figure 9.8 Controller F or Combined Spatial Domain In visible Rob ust / Fragile W atermarking 223 Figure 9.9 Layout of the In visible Spatial Domain W atermarking Datapath and Controller 225 Figure 9.10 Layout of RAM (Zoomed vie w of a portion is sho wn) 226 Figure 9.11 Layout of the Proposed Spatial Domain In visible W atermarking Chip 227 Figure 9.12 Pin Diagram for the Proposed Spatial Domain In visible W atermarking Chip 227 xi PAGE 15 Figure 9.13 Spatial Domain In visible W atermark ed Shuttle 228 Figure 9.14 Spatial Domain In visible W atermark ed Bird 228 Figure 9.15 Datapath Architectures for the V isible W atermarking Algorithms 235 Figure 9.16 Indi vidual Datapath Units for Algorithm 2 237 Figure 9.17 Architecture for the Proposed W atermarking Processor 239 Figure 9.18 Layout of Datapath and Controller of the Proposed Chip 241 Figure 9.19 Layout and Floor Plan of the Proposed W atermarking Chip 242 Figure 9.20 Pin Diagram for the Proposed W atermarking Chip 243 Figure 9.21 Original Host Images (a, b, and c) and W atermark Image (d) 244 Figure 9.22 W atermark ed Images for the First Algorithm 245 Figure 9.23 W atermark ed Images for the Second Algorithm 245 Figure 9.24 Combined Architecture for DCT domain In visible and V isible W atermarking Chip 251 Figure 9.25 Architecture of the Dif ferent Units used for In visible W atermarking 252 Figure 9.26 Architecture of the Dif ferent Units used for V isible W atermarking 253 Figure 9.27 Dual V oltage and Dual Frequenc y Operation of the Datapath 254 Figure 9.28 Layout of the DCT Domain In visible and V isible W atermarking Chip [85 ] 255 Figure 9.29 Floorplan of the DCT Domain In visible and V isible W atermarking Chip [85 ] 255 xii PAGE 16 ENERGY AND TRANSIENT PO WER MINIMIZA TION DURING BEHA VIORAL SYNTHESIS Saraju P Mohanty ABSTRA CT The proliferation of portable systems and mobile computing platforms has increased the need for the design of lo w po wer consuming inte grated circuits. The increase in chip density and clock frequencies due to technology adv ances has made lo w po wer design a critical issue. Lo w po wer design is further dri v en by se v eral other f actors such as thermal considerations and en vironmental concerns. In lo wpo wer design for battery dri v en portable applications, the reduction of peak po wer peak po wer dif ferential, a v erage po wer and ener gy are equally important. In this dissertation, we propose a frame w ork for the reduction of these parameters through datapath scheduling at beha vioral le v el. Se v eral ILP based and heuristic based scheduling schemes are de v eloped for datapath synthesis assuming : (i) single supply v oltage and single frequenc y (SVSF), (ii) multiple supply v oltages and dynamic frequenc y clocking (MVDFC), and (iii) multiple supply v oltages and multic ycling (MVMC). The scheduling schemes attempt to minimize : (i) ener gy (ii) ener gy delay product, (iii) peak po wer (i v) simultaneous peak po wer and a v erage po wer (v) simultaneous peak po wer a v erage po wer peak po wer dif ferential and ener gy and (vi) po wer uctuation. A ne w parameter called Cycle Po wer Function./$%'&10is dened which captures the transient po wer characteristics as the equally weighted sum of normalized mean c ycle po wer and normalized mean c ycle dif ferential po wer Minimizing this parameter using multiple supply v oltages and dynamic frequenc y clocking results in the reduction of both ener gy and transient po wer The c ycle dif ferential po wer can be modeled as either the absolute de viation from the a v erage po wer or as the c ycletoc ycle po wer gradient. The switching acti vity information is obtained from beha vioral simulations. Po wer uctuation is modeled as the c ycletoc ycle po wer gradient and to reduce ucxiii PAGE 17 tuation the mean po wer gradient.2*+%,'0is minimized. The po wer models tak e into consideration the ef fect of switching acti vity on the po wer consumption of the functional units. Experimental results for selected highle v el synthesis benchmark circuits under dif ferent constraints indicate that signicant reductions in po wer ener gy and ener gy delay product can be obtained and that the MVDFC and MVMC schemes yield better po wer reduction compared to the SVSF scheme. Se v eral application specic VLSI circuits were designed and implemented for digital w atermarking of images. Digital w atermarking is the process that embeds data called a w atermark into a multimedia object such that the w atermark can be detected or e xtracted later to mak e an assertion about the object. A class of VLSI architectures were proposed for v arious w ater marking algorithms : (i) spatial domain in visiblerob ust w atermarking scheme, (ii) spatial domain in visiblefragile w atermarking scheme, (iii) spatial domain visible w atermarking scheme, (i v) DCT domain in visiblerob ust w atermarking scheme, and (v) DCT domain visible w atermarking scheme. Prototype implementation of (i), (ii) and (iii) are gi v en. The hardw are modules can be incorporated in a JPEG encoder or in a digital still camera. xi v PAGE 18 CHAPTER 1 INTR ODUCTION Lo w po wer circuit design is a three dimensional problem in v olving area, performance and po wer tradeof fs. Because of the decreasing feature size and increasing packing density it may be possible to trade area against po wer [1 ]. The trend of decreasing de vice size and increasing chip densities in v olving se v eral hundred millions of transistors per chip has resulted in tremendous increase in design comple xity Designing chips of such comple xity using traditional captur e and simulate methodology is time consuming and dif cult. The industry has started looking at the de v elopment c ycle to reduce design time and to gain a competiti v e edge. Highle v el synthesis of digital circuits has become necessary due to se v eral adv antages such as, reduction of design time, e xploration of dif ferent design styles, meeting design constraints and requirements [2 3, 4]. Additionally this trend of reducing the feature size with increasing the clock frequenc y has made reliability a big challenge for the designers, mainly because of high onchip electric elds [1 5 6, 7, 8 ]. Fig. 1.1 sho ws the chronologcal change in po wer po wer density transistor count, gate count, operating frequenc y and feature size of CMOS ICs. Highle v el synthesis process can be dened as the translation process from beha vioral description to its structural description [3 14 4, 15 ]. This is analogous to a compiler that translates a highle v el language program in C/P ascal to an assembly language program. Highle v el synthesis is also kno wn as beha vioralle v el synthesis or algorithmle v el synthesis. The constraints which are to be considered in highle v el synthesis are area, performance, po wer consumption, reliability testability and cost. W ith the increasing demand for personal computing de vices and wireless communications equipment, the demand for designing lo w po wer consuming circuits has increased. Po wer has become an important parameter alongwith area and throughput. The need for lo w po wer synthesis is dri v en by se v eral f actors [16 17 18 19 20]: 1 PAGE 19 (a) Increase in Po wer [8, 9 10 ] (b) Increase in Po wer Density [9, 11, 10] (c) Increase in T ransistor Count [11 10 ] (d) Increase in Gate Count [12 ] (e) Increase in Frequenc y [11 10 ] (f) Decrease in Feature Size [11 10 13] Figure 1.1. Chronological Change in Po wer Po wer Density T ransistor Count, Gate Count, Oper ating Frequenc y and Feature Size of CMOS Inte grated Circuits 2 PAGE 20 3Increased demand for portable systems: Emer gence of portable de vices lik e laptop computers, mobile phones etc. for which battery life is an important f actor3Thermal considerations: If po wer dissipation can be reduced, the cost of cooling and packaging w ould be reduced.3En vironmental concerns: The smaller the po wer dissipation in a circuit, lesser the heat pumped into the rooms. So, the electricity consumption will be lo wer and impact on the en vironment will be less.3Reliability issues: If the po wer consumption is higher the temperature in the circuit is increased. This may lead to phenomenon lik e electromigration and hotelectron ef fects. This causes reduction in the reliability of the system. In f act, it is seen that for e v ery4"657$rise in operating temperature, roughly doubles the f ailure rate of the components. The gro wth of high speed computer netw orks and that of the internet, in particular has e xplored means of ne w b usiness, scientic, entertainment, and social opportunities. Ironically the cause for the gro wth is also of the apprehension use of digital formatted data. Digital media of fer se v eral distinct adv antages o v er analog media, such as high quality easy editing, high delity cop ying. The ease by which a digital information can be duplicated and distrib uted has led to the need for ef fecti v e cop yright protection tools. V arious softw are products ha v e been recently introduced in attempt to address these gro wing concerns. It is done by hiding metadata (information) within digital audio, images and video les. One w ay of such data hiding is digital signatur e copyright label or digital watermark that completely characterizes the person who applies it and, therefore, marks it as being his intellectual property Digital W atermarking is the process that embeds data called a w atermark into a multimedia object such that w atermark can be detected or e xtracted later to mak e an assertion about the object. While the softw are implementation of digital w atermarking techniques are enormously lar ge, the hardw are of the same is ne gligibly small. The hardw are implementation has adv antages o v er the softw are implementation in terms of lo w po wer high performance and reliability Also, the hardw are implementation of w atermarking techniques is absolutely essential for realtime w atermarking applications, such as of digital TV broadcasting. 3 PAGE 21 This chapter presents a general o v ervie w of highle v el synthesis and po wer minimization in VLSI circuits. The chapter is or ganized as follo ws. Section 1.1 discusses highle v el synthesis in general and moti v ation behind high le v el synthesis. The v arious sources of po wer consumption are discussed in Section 1.2. The possible methods of po wer reduction are described in Section 1.3. Section 1.4 discusses why we need to minimize peak po wer The need for a v erage po wer and ener gy reduction is listed in Section 1.5 and that of transient po wer is in Section 1.6. Section 1.7 discusses ho w frequenc y and v oltage scaling can reduce ener gy / po wer in a circuit. The fundamentals of digital w atermarking is discussed in Section 1.9. The design issues for multiple supply v oltage and dynamic frequenc y clocking based circuits are discussed in Section 1.8. Section 1.10 discusses the contrib ution of this dissertation. The dissertation outline is gi v en in Section 1.11. 1.1 Fundamentals of High Le v el Synthesis In circuit analysis, we study the beha vior or characterisitcs of a circuit. Synthesis process is the re v erse of analysis process. The task of synthesis process is to tak e the specications of the beha vior required for a system and a set of constraints and goals to be satised, and to nd a structure that implements the beha vior while satisfying the goals and constraints [3 4 15 21 ]. The beha vior of the system refers to the w ays in which the system or its components interact with their en vironment (mapping from inputs to outputs). The structure refers to the set of interconnected components that constitute the system (described by a netlist). Finally the structure must be mapped into a physical design. Beha vior structure and physical design are considered as three domains in which a hardw are can be described (Fig. 1.2(a) and 1.2(b)). In beha vioral domain, we are interested in what a design does, not in ho w it is b uilt. The physical domain ingnores what the design is supposed to do and binds its structure in space or to silicon. A structual representation bridges the beha vioral and physical representation. It is onetoone mapping of a beha vioral representation onto a set of components and connections under constraints, such as area, cost and delay Fig. 1.2(a) describes the design automation terminologies, such as optimization, synthesis, analysis, and optimization in the hardw are representation domain. The ax es in Y chart (Fig. 1.2(b)) 4 PAGE 22 Physical / Geometrical Domain Structural Domain Behavioral Domain Abstraction Analysis Synthesis Generation Extraction Optimization Refinement (a) Y chart : Anaylsis, Optimization or Synthesis Physical / Geometrical Domain Structural Domain Behavioral Domain Circuit Synthesis RT Synthesis Logic Synthesis System Synthesis Transistor Function Algorithms Register Transfer Boolean Expressions Transistor Layouts Cells Chips Boards, MCMs Processors, Memories, Buses Registers, ALUs, MUXs Gates, FlipFlops Transistors (b) Y chart : Detailed Hardw are Description Figure 1.2. Desription of Hardw are in Dif ferent Domains and Abstractions [4 ] 5 PAGE 23 (Tranformation, Scheduling, Module Selection) (TwoLevel, MultiLevel Synthesis) Allocation or Partitioning) (Hardware / Software (Placement, Routing, Clock Distribution) System Specifications Behavioral Description RTL Description Gate Level Description Layout Level Description High Level Synthesis System Level Design Logic Synthesis Layout Synthesis Figure 1.3. Synthesis Flo w represent three dif ferent domains of description, such as behv aioral, structural and physical. Each concentric circle intersects the ax es at a particular le v el of representation within a domain. It may be noted that the synthesis process is a transformation from the beha vioral domain to the structual domain, which is represented as an arc in Fig. 1.2(a). The digital circuits are designed and synthesised at se v eral le v els of abstraction as sho wn in Fig. 1.3.3System Le v el: The system le v el is concerned with the o v erall system structure and information o w Computer systems are described as interconnected set of processors, memories and switches in this le v el. 6 PAGE 24 3Beha vioral Le v el: This le v el is also called as Instruction Set Le v el or Algorithmic Le v el. At this le v el the focus is on the computations performed by an indi vidual processor the w ay it maps sequences of inputs to sequences of ouputs.3Re gister T ransfer Le v el: The system is vie wed as a set of interconnected storage elements and functional blocks in this le v el. The beha vior of system is described as a series of data transfers and transformations between the storage elements.3Logic Le v el: Belo w the re gister transfer le v el is the logic le v el. The system is described as a netw ork of gates and ipops and the beha vior is specied by logic equations at this le v el.3Layout Le v el: In this le v el, the system is specied in terms of the indi vidual transistors of which it is composed. The beha vior of the system can be described in terms of the netw ork equations. 1.1.1 Wh y HighLe v el Synthesis ? Highle v el synthesis is popular for the follo wing reasons [3 ]:3Shorter design c ycle: If more of the design process is automated, f aster products can be made a v ailable at cheaper prices.3Fe wer errors: Since the synthesis process can be v eried easily the chances of getting errors will be less.3Ability to search the design space: As synthesis system can produce se v eral designs in a small time, the designer has more e xibity to choose proper design considering dif ferent tradeof fs.3Documenting the design process: An automated system can k eep track of design decisions and ef fect of those decisions.3A v ailability of IC technology to more people: As design e xpertise is mo v ed into synthesis system, it becomes easier for a none xpert to produce a chip that meets a gi v en set of specications. 7 PAGE 25 1.1.2 V arious Phases of HighLe v el Synthesis The v arious phases of highle v el synthesis include, compilation, transformation, scheduling, allocation, binding as detailed in Fig. 1.4. HDL Compilation Transformation Scheduling Allocation / Binding Output Generation RTL Description Data Flow Graph Figure 1.4. V arious Phases of HighLe v el Synthesis The beha vior of a system to be synthesized is usually specied at the algorithmic le v el using a highle v el programming language lik e P ascal, C or a hardw are description language such as VHDL and V erilog [3 22 ]. The beha vior of the system is then compiled into internal representations, which are usually data o w graphs (DFGs) and control o w graphs (CFGs). Each beha vioral specication is transformed into an unique graphical representation. The data o w graph is a 8 PAGE 26 directed graph which represents the data mo v es, while the control o w graph is a directed graph which indicates the sequence of operations. The formal denitions of data o w graph and control o w are gi v en belo w [3]. A data o w graph (DFG) is a directed graph,+ 8./9:<;)0, where: (i)9= ?>A@4:B>DC:4EEE:B>Fis a nite set whose elements are nodes, and (ii);= +98GH9is an asymmetric data o w relation, whose elements are directed data edges. A control o w graph (CFG) is a directed graph,+ 8./9:<;)0, where: (i)9= ?> @ :B> C :4EEE:B> Fis a nite set whose elements are nodes, and (ii);= +98GH9is a control o w relation, whose elements are directed sequence edges. Lets consider the follo wing algorithm that computes the square root ofIusing Ne wton' s method [3 ]. Algorithm : Square Root Calculations J KML N"#DPOQ"#RDSTI; UVL W"; Do untilUYX[Zloop KML N"#nP. K O\ ]^0; UVL U O; End do The abo v e algorithm can be represented using the follo wing data o w graph and control o w graph (Fig. 1.5). In the transformation step, the initial data o w graph is transformed so that the resultant data o w graph is more suitable for scheduling and allocation. These transformations include compiler lik e optimizations such as dead code elimination, common sube xpression elimination, loop un9 PAGE 27 * + / + X 0.89 0.22 Y 0.5 Y I := + > ctl 1 3 0 I (a) Data Flo w Graph (DFG) + := / + > + True False (b) Control Flo w Graph (CFG) Figure 1.5. Data Flo w Graph and Control Flo w Graph of a Square Root Algorithm [3 ] rolling, constant propagation and code motion. In addition to this, some hardw arespecic transformations lik e syntactic v ariances minimization, retiming may be applied to to tak e adv antage of the associati vity and commutati vity of certain operations. Scheduling is the process of partitioning the set of arithmetic and logical operations in the data o w graph into groups of operations so that the operations in the same group can be e x ecuted concurrently while taking into consideration possible tradeof fs between the total e x ecution cost and hardw are cost. A group of concurrent computations to be e x ecuted simultaneously is referred to as control step. The total number of control steps needed to e x ecute all operations in the data 10 PAGE 28 o w graph, the minimum number of functional units of each type to be used in the design, and the lifetimes of the v ariables generated during the computation of operations are determined in the scheduling step. Datapath scheduling algorithms may be of v arious types based on the constraints and optimization schemes as sho wn in Fig. 1.6. V arious scheduling algorithms are described in [4 21, 22 3, 23 24 25 26, 27 28 29 30 31 32 33 2 34 35 36 37 38 ]. The commonly used scheduling techniques are inte ger linear programming, assoonas possible, aslateas possible, listbased scheduling, force directed scheduling and freedombased scheduling, etc. Miscellaneous Algorithms can be extended Iterative Refinement ForceDirected ListScheduling FreedomBased Scheduling Scheduling Symbolic Genetic Algorithm Geometric Algorithm Simulated Annealing Scheduling Algorithms Unconstrained Algorithms Resource Constrained Algorithms Time Constrained Algorithms Time and Resource Constrained Algorithms Miscellaneous Algorithms ASAP ALAP ListBased ILPBased ForceDirected ILPBased ILPBased Static List FeasibleConstrained PathBased Figure 1.6. Dif ferent T ypes of Scheduling Algorithms Allocation is the process of determining functional units of each type for performing operations, memory units(re gisters) for storing data v alues, and interconnects for data transportation. Binding is the process of assigning v ariables to memory units, and data transfers to interconnections. Allocation / binding is further di vided into tasks, such as functional unit allocation / binding, memory unit allocation/binding and interconnect allocation / binding. The functional unit allocation / binding in v olv es the mapping of operations in the beha vioral description into a set of selected functional units. The memory unit allocation / binding maps data carriers(constants, v ariables, ar rays) in the beha vioral description onto storage elements(R OMs, re gisters, memory units) in the 11 PAGE 29 datapath. The interconnect allocation / binding task maps e v ery data transfer in the beha vior into a set of interconnection units for data routing. In the output generation phase, design output is generated. The output should be in a form, so that logicle v el synthesis tools can optimize the combinational logic, and layout synthesis tools can design the chip geometry The generated output is generally in a lo w le v el hardw are description language, such as structural VHDL or EDIF [22 ]. 1.1.3 A Synthesis Example Let us consider a small synthesis e xample to learn the v arious phases of synthesis in detail. Suppose, we w ant to synthesize hardw are to perform the operation :` a.bIO K 06T.c;[de&0. The follo wing self e xplanatory Figs. (1.7 1.8) illustrate the steps. 1.2 Sour ces of P o wer Dissipation in a CMOS Cir cuit The details of po wer dissipations are sho wn in Fig. 1.9. Po wer dissipation in a CMOS circuit is caused by four sources [17 ] :3Leakage current: It is determined by the f abrication process technology and consists of tw o components: (1) re v erse bias current in the parasitic diodes formed between source and drain dif fusions and the b ulk re gion in the transistor and (2) the subthreshold current that arises from the in v ersion char ge that e xists at the gate v oltages belo w the threshold v oltage.3Standy current: It is the DC current dra wn continuously form9gfhfto ground.3Shortcircuit current: This is the current due to the DC path between the supply and ground during output transitions.3Capacitance current: This curent o ws to char ge and dischar ge capacitance loads during logic changes. 12 PAGE 30 + Z <= (X+Y) (EF); VHDL Code (Structural) X Y E F DFG Z (a) Step1: Compilation and T ransformation + X Y E F Z + X Y E F Z CT1 CT2 CT3 CT2 CT1 Two Control Steps Two operations in parallel No parallel operation Three Control Steps (b) Step2: Scheduling (T ime or Resource Constraints) X Y E F CT1 CT2 CT3 + X Y E F Z + Register Register Register Z Register ADD MULT SUB ALU ALU MULT 1 adder, 1 subtractor and 1 multiplier 1 ALU and 1 multiplier (c) Step3: Allocation (Fix es Amount and T ypes of Resources) Figure 1.7. A Synthesis Example : Step 1 to Step 3 13 PAGE 31 X Y E F CT1 CT2 CT3 + X Y E F Z + Z Register_A Register_B Register_A Register_B ALU_J ALU_K ALU_J ALU_J MULT_I MULT_I (a) Step4: Binding (which Resource will be used by which Operation) Y Register_A E Sel_B Sel_A XALU_JMUX_B Z Register_BMULT_IMUX_A F (b) Step5: Connection Allocation (Communication between Resources: Bus, Buf fer or MUX) E Register_A F Sel_B Sel_A XALU_JMUX_B Z Register_BMULT_IMUX_A Y CT1 Action A = X + Y Signals : Sel_A, Sel_B, load(Reg_A) CT2 Action : B = E F Signals : Sel_A, Sel_B, load(Reg_B) CT3 Action : Z = A B Signals : load(Reg_Z) DATAPATH CONTROL (c) Step6: Architecture Generation (Datapath and Control) Figure 1.8. The Synthesis Example : Step 4 to Step 6 14 PAGE 32 Diode Leakage SubThreshold Current Leakage Standby Static Short Circuit Capacitive Switching Dynamic Power Dissipation Figure 1.9. Sources of Po wer Dissipation in a CMOS Circuit Capacitive switc hing power dissipation is caused by char ging and dischar ging of parasitic capacitance in the circuit and is gi v en by Eqn. 1.1,%ifBj Fk PAGE 33 can be made small with the proper choice of de vice technology Standby power dissipation happens when both the nMOS and pMOS transistors are continuously on in a psuedonMOS in v erter when the drain of an nMOS transistor is dri ving the gate of another nMOS transistor in a passtransistor logic, or when the tristated input of a CMOS gate leaks a w ay to a v alue between9fhfand ground. The staticcir cuit power dissipation is the sum of the leakage and standby po wer dissipations. The total static po wer of a CMOS circuit is obtained using the Eqn. 1.3 as gi v en belo w (assumingnumber of transistors). In practice, standby po wer is ne glected compared to the leakage po wer and static po wer is assumed to be the leakage po wer .%w  k  m F m@leakage currentsupply v oltage F m@T UdiodeO Usubthreshold supply v oltage (1.3) 1.3 Methods f or P o wer Reduction in HighLe v el Synthesis Leakage po wer dissipation is small in comparison to other components. In a well designed circuit, shortcircuit po wer dissipation is less than"of dynamic po wer [39 ]. It is also e vident from Fig. 1.10 [6 7] that at lar ger switching acti vity the static po wer is ne gligible compared to the dynamic po wer dissipation. This sho ws that the dynamic po wer dissipation is the the main po wer dissipation that needs to be tak en care of. From the dynamic po wer dissipation e xpression gi v en in Eqn. 1.1, we can conclude that the parameters that can be v aried to af fect po wer as well as ener gy consumption are :3supply v oltage,3the clock frequenc y ,3the switching acti vity per clock c ycle at v arious signals in the circuit,3the parasitic capacitance. It is important to note that these parameters are not independent. It is necessary to tak e into account the interactions and tradeof fs among these parameters to minimize po wer consumption [17 ]. The k e y principles used for lo wpo wer design are as follo ws [20 40 ] : 16 PAGE 34 Figure 1.10. Static Vs Dynamic Po wer Dissipation for Dif ferent Switching Acti vity [6, 7]3using the lo west possible supply v oltage3using the smallest geometry highest frequenc y de vices, b ut operating them at lo west possible frequenc y ,3using parallelism and pipelining to lo wer required frequenc y of operation,3po wer management by disconnecting the po wer source when the system is idle, and3designing systems to ha v e lo west requirements on subsystem performance for the gi v en le v el functionality Based on the abo v e observ ations, follo wing are the some techniques used to reduce po wer consumption in highle v el synthesis [41 22 1, 9, 42, 40 ].3T ransformation: The basic approach is to scan the design space by utilizing v arious o w graph transformations with highle v el po wer estimation techniques, and transform data o w graphs into less po wer consuming data o w graphs. 17 PAGE 35 3Operator shutdo wn: The massi v e switching in lar ge components, such as adders, multipliers and re gisters, consume a lar ge amount of po wer By disabling the clock signal the internal nodes remain at static v oltage le v els and do not consume po wer .3Lo wer supply v oltages: In a CMOS circuit, po wer consumption decreases quadratically with v oltage while the speed reduction is linear When intensi v e computation is not needed, the supply v oltage is lo wered and consequently can sa v e po wer consumption.3Mix ed v oltage circuit: Dual v oltages on one IC are attracti v e enough for commercial consideration. Although such an approach is viable, designers must carefully consider crosstalk and latchup issues among others.3Increased parallelism: Slo wer operations can be used on nontime critical paths, while parallelism can be increased to compensate for slo wer components. The parallel option consumes less po wer and has a shorter total delay Ho we v er e xtra area might be needed to achie v e the parallelism. 1.4 Wh y P eak P o wer Minimization ? W ith the increase in chip densities and clock frequencies the demand for design of lo w po wer inte grated circuits has increased. The literature is rich on ef forts to reduce total ener gy consumption and a v erage po wer consumption of the CMOS circuits. At the same time, the reduction of peak po wer consumption is essential for the follo wing reasons [43 5, 8, 44 45 46 ] :3to maintain supply v oltage le v els,3to increase reliability and3smaller heat sinks and cheaper packaging. The peak po wer is the maximum po wer consumption of the inte grated circuit (IC) at an y instance during its e x ecution. If the current o w is lar ge, then theUdrop of the interconnects becomes lar ge which can reduce the supply v oltage le v els at dif ferent parts of a IC. High current o w can 18 PAGE 36 reduce reliability because of hot electron ef fects and high current density The hot electron ef fects may lead to runa w ay current f ailures and electrostatic dischar ge f ailures. High current density can cause electromigration f ailure. It is observ ed that the mean time to f ailure (MTF) of CMOS circuit is in v ersely proportional to current density (or po wer density). If the current (po wer) dissipation is lar ge, then the heat generated out of the system is lar ge. This in turn, needs bigger sink and costlier heat dissipation mechanism in order to maintain the operating temperature of the ICs in its tolerance limit. 1.5 Wh y A v erage P o wer and Ener gy Reduction ? Ener gy and a v erage po wer reduction is essential for the follo wing reasons [17 8 5 46 ]:3to increase battery life time,3to enhance noise mar gin,3to reduce cooling and ener gy costs,3to reduce use of natural resources, and3to increase system reliability The battery life time is determined by then(ampere hour) rating of the battery If the a v erage po wer (and/or ener gy) consumption is high, then battery life time may reduce because of high ampere consumption. This f actor is important for portable applications. The reduction of a v erage po wer is essential to enhance noise mar gin (to decrease functional f ailure). The cost of packaging and cooling is determined by a v erage current o w and hence, the a v erage po wer and ener gy The high ener gy consumption of the computer systems leads to en vironment concerns due to the need for more po wer generation. If the a v erage po wer is lar ge, the operating temperature of the chip increases, which may lead to f ailures. It is estimated that for each4"57$increase in the operating temperature, the f ailure rates of the components is roughly doubled. 19 PAGE 37 1.6 Wh y T ransient P o wer Minimization ? Both the peak po wer and peak po wer dif ferential describe the transient po wer characteristics of a CMOS circuit. In the abo v e section we discussed the needs for peak po wer reduction. The peak po wer dif ferential needs to be reduced for the follo wing reasons [8 5 47 48]:3to reduce po wer supply noise,3to reduce cross talk and electromagnetic noise,3to increase battery ef cienc y and3to increase reliability Po wer uctuation leads to lar gerf m f causing po wer supply noise, (similar toUdrop), because of self inductance of po wer supply lines. Crosstalk is the noise v oltage induced in signal line due to the switching in another signal line [5 ]. The v oltage induced by the mutual inductance is e xpressed as f m f and that induced by the mutual capacitance as$ fB f . If the po wer uctuation is high, then lar gef m f andfB f can introduce signicant noise in the signal lines. As the po wer uctuation increases, it reduces the electrochemical con v ersion and hence there is decrease in battery life [49 ]. High current peaks (po wer uctuation) in short time spans can cause high heat dissipation in a localised area of silicon die which may lead to permanent f ailure of the inte grated circuit. 1.7 Wh y Fr equency and V oltage Scaling ? W ith the increasing demand for portable electronic de vices, po wer reduction has emer ged as a major design goal in VLSI circuits. Let us consider the follo wing equations for a CMOS circuit [50 51 52 53 54 55 56 ] :3Ener gy dissipation per operation is;f PAGE 38 3Po wer dissipation for the operation is%ifBj FkhlTm W$o 7 9 C f PAGE 39 v oltage scheme, dif ferent modules or functional units are operated at dif ferent supply v oltages. Similarly v ariable v oltage scheme is a technique in which the operating v oltage is v alid from time to time. This chapter discusses ho w ener gy and po wer reduction can be achie v ed through the use of dynamic frequenc y clocking, v oltage scaling multic yling. Further the design related issues of ha ving multiple supply v oltages in a processor are discussed. Design of le v el con v erters and dynamic frequenc y clocking units are also presented. 1.8.1 What is Dynamic Fr equency Clocking ? In dynamic frequenc y clocking, the functional units can be operated at dif ferent frequencies depending on the computations occuring within the datapath during a gi v en clock c ycle. The strate gy is to schedule high ener gy units, such as multipliers at lo wer frequencies such that the y can be operated at lo wer v oltages to reduce ener gy consumption and the lo w ener gy units, such as adders at higher frequencies, to compensate for speed. In this clocking scheme, all the units are clock ed by a single clock line which switches at runtime. A clocking mechanism that v aries the clock frequenc y dynamically has been sho wn to impro v e the e x ecution time as compared to using a unifrequenc y global clock [59 ]. Generation of such types of clocks ha v e been studied e xtensi v ely in [60 61 62 63 ]. Fig. 1.11(a) sho ws the unifrequenc y and dynamic frequenc y diagrams. The dynamic clocking unit (DCU) which generates the required clock frequenc y uses a clock di vider strate gy to generate frequenc y which are submultiples of the base frequenc y Base frequenc yr¢ k£w is the maximum frequenc y (or multiple of maximum) of an y functional unit (FU) at the maximum supply v oltage. A v alue¤ ru (c ycle frequenc y inde x for control step¤) is loaded as an input to the DCU which comes from controller The scheme for dynamic frequenc y generation is sho wn in Fig. 1.11(b). Loading a v alue of¤ ru into the counters pro vide a di vided output clock of frequenc yz¦§B¨b m. 1.8.2 Ener gy or P o wer Reduction Due to V oltage or Fr equency Scaling T o understand ho w multiple supply v oltage, v ariable frequenc y and multic ycling can be helpful in ener gy or po wer reductions, let us consider the small data o w graph sho wn in Fig. 1.12(a). 22 PAGE 40 = = Clock Cycle 1 Clock Cycle 2 Clock Cycle 1 Clock Cycle 2 Clock Cycle 3 Clock Cycle 3 Clock Cycle 1 = Clock Cycle 2 Clock Cycle 3 = Clock Cycle 1 Clock Cycle 2 Clock Cycle 3 (a) Single Frequenc y Vs Dynamic Frequenc y cfi c f base / f base cfi c Dynamic Clocking Unit (DCU) (b) Dynamic Frequenc y Generation Figure 1.11. Dynamic Frequenc y Generation using Dynamic Clocking Unit [54 ] Let us analyse the po wer ener gy consumption for this data o w graph in three possible modes of datapath operation, such as (i) single supply v oltage and single frequenc y (ii) multiple supply v oltage and v ariable or dynamic frequenc y and (iii) multiple supply v oltage and multic ycling [54 55 64 ]. Let kand lbe the delays of the adder and the multiplier respecti v ely at the maximum supply v oltage9. The DFG is scheduled to three control steps. Single supply volta g e and single fr equency (SVSF) : Each c ycle has clock width determined by the slo west operator delayBl. The total ener gy consumption is gi v en by; w 8;lO;kand the total delay is w Z xl. In this case, the peak po wer consumption is gi v en by ,% r kh w §  . Multiple supply volta g es and dynamic fr equency (MVDFC) : Let,; land; kare some ener gy v alues less than;Pland;krespecti v ely and lbe the delay of the multiplier at lo wer v oltage9 . In data o w graph sho wn in Fig. 1.12(a), assuming that, the clock c ycle width for theZrd c ycle isBkwhich is smaller thanBl. This allo ws us to increase the clock width of some other c ycles from£lto some lwithout violating the time constraints (or without time penalty). In this case, the total 23 PAGE 41 t m t m t m V E m E m E a E a t m + t a t m V V V V V V E + V + m E m E a E a Cycle1 Cycle2 Cycle3 Single Frequency Dynamic Frequency (a) Data Flo w Graph : V ariable Frequenc y Vs Single Frequenc y t m t m t m + + t E m E E V V V V E a V m m/2 a E m/2 Cycle1 Cycle2 Cycle3 Cycle4 Multicycling (b) Data Flo w Graph : Multic yclingPerformance De gradation Single Voltage and Single Frequency Multiple Supply Voltages and Multicycling + + + + (c) Data Flo w Graph : Multic yclingNo Performance De gradation Figure 1.12. Data Flo w Graph in Three Modes of Operation 24 PAGE 42 delayf l OszlOsxkand the ener gy consumption is gi v en by;f ;lO;PkiO;V l O;V k. Since,gf W w and;Pf n ; w , ener gy reduction is achie v ed without de grading performance. Ener gy o v erhead of le v el con v erters ha v e to be considered for this case. The peak po wer consumption is gi v en by ,% r kq f §  . Multiple supply volta g es and multicycling (MVMC) : In this mode of operation, the functional units are operated at multiple supply v oltages. The functional units operating at lo w v oltage are made to run in more than one consecuti v e control steps. Let us assume that multiplier tak es tw o control steps, when it is operated at a lo wer supply v oltage. The e xample data o wgraph for the multic ycling case in sho wn in Fig. 1.12(b). In this case, the total ener gy consumption;l ;lO?;V l ON;kand total delayl Dxl. Since,l X w and;l ; w , ener gy reduction is obtained with a de gradation in performance of the circuit. F or the multic ycling case, le v el con v erters are the only o v erheads. The peak po wer consumption of the DFG will be determined by the multiplication operation in control step 1,% r kq l v  . This is based on the observ ation that the po wer consumption of the multipliers are much higher than that of the adders. It may be noted the abo v e mentioned performance de gradation may not al w ays happen. F or e xample, consider a DFG such as the one sho wn in Fig. 1.12(c); although the multiplier is scheduled in tw o control steps there is no change is the critical path delay The delay isZ lfor both SVSF and MVMC cases. 1.8.3 Issues in Multiple Supply V oltage Based Design A designer needs to tak e into consideration se v eral design issues when a multiple v oltage design is tar geted for f abrication. The ef fects of multiple v oltage operation on IC layout and po wer supply requirements should be considered [65 66 67 ]. Multiple v oltage design may af fect IC design in the follo wing w ays :3If the multiple supplies are generated of fchip, additional po wer and ground pins will be required. 25 PAGE 43 3It may be necessary to partition the chip into separate re gions, where all modules in a re gion operate at the same v oltage.3Some kind of isolation will be required between the re gions operated at dif ferent v oltages.3There may be some limit on the v oltage dif ference that can be tolerated between the re gions.3Protection against latchup may be needed at the logic interf aces between re gions of dif ferent v oltages.3Ne w design rules for routing may be needed to deal with signals at one v oltage passing through a re gion at another v oltage.3Choice between generating the v oltage onchip or of fchip has to be made depending on the application.3Clocking scheme needs to be modied. 1.8.4 Le v el Con v erter Design W e already kno w that whene v er one resource has to dri v e an input of another resource operating at a dif ferent v oltage, a le v el con v ersion is needed. Thus, le v elcon v erter or le v elshifter is the most essential component for multiple supply v oltage designs. This results in o v erheads in the form of area and po wer for multiple supply v oltage designs as compared to single supply v oltage designs. F our possible alternati v es are used by v arious researchers as listed belo w [65 ].3The le v el con v eters can be omitted.3A chain of in v erters can be used at successi v e higher v oltages.3An acti v e or passi v e pullup can be used.3A dif ferential cascode v oltage switch (D VCS) can be used. V arious le v el con v erter designs ha v e been discussed in [66 68 69 67 65 ]. W e implemented the le v el con v erter design proposed in [65 66 ] to get better understanding. The schematic diagram, 26 PAGE 44 Figure 1.13. Le v el Con v erter Schematic Diagram [65 66 ] the layout and the simulation w a v eform is gi v en in Fig. 1.13, 1.14(a) and 1.14(b) respecti v ely The constant output v oltage indicates that the le v el con v erter can step up or step do wn the v oltage to produce a constant supply v oltage. 1.8.5 Dynamic Fr equency Clocking Unit Design Dynamic frequenc y scaling is an ef cient po wer reduction method with lar ge potential po wer sa vings. In order to e xploit dynamic frequenc y scaling for ener gy or po wer reduction, a clock di vider is needed to safely change the clock rates. In this section, the design of tw o such dynamic frequenc y clocking units present in the e xisting literature [59 61] are described. 27 PAGE 45 (a) Le v el Con v erter Layout (b) Le v el Con v erter Simulation W a v eform Figure 1.14. Le v el Con v erter Layout and Simulation Ranganathan, V ijaykrishnan and Bha v anishankar [59 ] introduce the concept of dynamic frequenc y clocking. The DFC scheme is more suitable for data o w intensi v e application (such as DSP and image processing). In dynamic frequenc y clocking scheme, frequenc y switching occurs based on the units being used and on single clock which dri v es all the units. The dynamic clocking unit (DCU) generates dif ferent clock frequencies based on instruction w ords. The block diagram of the DCU is sho wn in Fig. 1.15. The DCU is a series of cascaded clock di vider stages whose inputs are controlled by the pass logic blocks. The output of one clock di vider is presented at the input of the ne xt stage when the pass logic is enabled. The pass logic block is controlled by a set of signals generated by the enable encoder Based on the instruction class, the appropriate pass 28 PAGE 46 Pass Logic Pass Logic Pass Logic Divide Logic Divide Logic Enable Encoder Input Clock (400 MHZ) Divide By Two (TFF) (TFF) (TFF) E[2] E[1] E[0] E[2:0] 4 Instruction Word Clk1 Clk2 Clk3 Clk4 4:1 MUX O/P Clock S[1:0] Encoder Clock Figure 1.15. Dynamic Clocking Unit : Ranganathan, et. al. [59 ] logic blocks are acti v ated by the enable encoder The master clock is accordingly di vided by clock di vider circuit to generate the resultant output clock. Brynjolfson and Zilic [61 ] propose a dynamic programmable clock di vider (DPCD) to use in conjugation with FPGA clock managers. Clock di vision by ordinary clock di viders can lead to glitches or distortions of the output clock. Distortions at the output clock can result in metastability and latching errors. The DPCD is capable of performing dynamic frequenc y di vision without undesired ef fects at the output. The circuit is sho wn in Fig. 1.16(a). Di vision of the input clock is performed by creating a loop of DipopsJAD_dri v en by the input clock, and feeding the signal back into the loop thorugh an in v erterJD_to create the necessary clock in v ersion. T o e xpand the length of the output clock, the number of Dipops in the loop is increased by multiple xorJL_. In order to perform an odd di vision, ipopsJE, F_e xtend the loop, by half a period, with an asynchronous clear of ipopJA_on the f alling edge of the input clock. F or the di vider output, multiple x erJN_chooses between the original input clock, for a di vison of one, and the output ofJA_. The output generated by the DPCD is sho wn in Fig. 1.16(b). T o pre v ent output glitching, DipopsJG,H,J,K_latch the ne w program v alue on the rising edge of the output fromJA_. Combinational logicJQ,R,S_also help to pre v ent glitching, b ut also pre v ent transient patterns from being captured and fed back, thus causing irre gular oscillation in the circuit. 29 PAGE 47 LDIV0 LDIV1 LDIV2 LDIV3 DIV0 DIV1 DIV2 DIV3 DIV1 DIV0 DIV2 DIV3 DIV3 DIV2 DIV1 clock DIV2 DIV3DIV2 DIV3clock CLR CLRN A B C D G H J K M E F L P Q R S NCLRN DIV3 CLRN CLRN CLRN CLRN CLRNDIV1 CLR U T A B S Y A B S YOUT S1 S0 IN0 IN1 IN2 IN3Q0 Q1 Q2 Q3 D D D D D D D D D D Q Q Q Q Q Q Q Q Q Q OUTPUTCLRNMULTIPLEXORMULTIPLEXOR (a) Dynamic Clocking Unit (b) Output Clock Generated Figure 1.16. Dynamic Clocking Unit and Output Clock : Byrnjolfson and Zilic [61 ] 30 PAGE 48 1.9 Fundamentals of Digital W atermarking Digital w atermarking technology is an emer ging eld in computer science, cryptography signal processing and communications. Digital W atermarking is intended by its de v elopers as the solution to the need to pro vide v alue added protection on top of data encryption and scrambling for content protection. Lik e other technology under de v elopment, digital w atermarking raises a number of essential questions as follo ws.3What is it?3Ho w can a digital w atermark be inserted or detected?3Ho w rob ust does it need to be?3Why and when are digital w atermarks necessary?3What can w atermarks achie v e or f ail to achie v e?3Ho w should digital w atermarks be used?3Ho w might the y be ab used?3Ho w can we e v aluate the technology?3Ho w useful are the y that is, what can the y do for content protection in addition to or in conjunction with current cop yright la ws or the le gal and judicial means used to resolv e cop yright grie v ances?3What are the b usiness opportunities?3What roles can digital w atermarking play in the content protection infrastructure ?3And man y more ... 31 PAGE 49 Figure 1.17. V isible W atermark ed Image [71 ] 1.9.1 General Framew ork f or W atermarking W atermarking is the process that embeds data called a w atermark or digital signature or tag or label into a multimedia object such that w atermark can be detected or e xtracted later to mak e an assertion about the object [70 ]. The object may be an image or audio or video. A simple e xample of a digital w atermark w ould be a visible seal placed o v er an image to identify the cop yright, one such e xample is sho wn in Fig. 1.17. Ho we v er the w atermark might contain additional information including the identity of the purchaser of a particular cop y of the material. In general, an y w atermarking scheme (algorithm) consists of three parts [72 ].3The w atermark.3The encoder (insertion algorithm).3The decoder and comparator (v erication or e xtraction or detection algorithm). Each o wner can use an unique w atermark for all objects or an o wner can use dif ferent w atermarks in dif ferent objects. The marking algorithm incorporates the w atermarks into the object. The v erication algorithm authenticates the object determining both the o wner and the inte grity of the object. A w atermark must be detectable or e xtractable to be useful. Depending on the w ay the 32 PAGE 50 w atermark is inserted and also on the nature of the w atermarking algorithm, the method used can in v olv e v ery distinct approaches. In some w atermarking schemes, a w atermark can be e xtracted in its e xact form, a procedure we call w atermark e xtraction. In other cases, we can detect only whether a specic gi v en w atermarking signal is present in an image, a procedure we call w atermark detection. It should be noted that w atermark e xtraction can pro v e o wnership whereas w atermark detection can only v erify o wnership. Fig. 1.18(a) illustrates the encoding process. Let us denote an image byU, a signature by N@4:hqC:4EEEand the w atermark ed image by U.;is an encoder function, it tak es an imageUand a signature, and it generates a ne w image which is called w atermark ed image U, mathematically ,;. U : 0 U(1.7) It should be noted that the signaturemay be dependent on imageU. In such cases, the encoding process described by Eqn. 1.7 still holds. A decoder functiontak es an image(can be a w atermark ed or unw atermark ed image, and possibly corrupted) whose o wnership is to be determined and reco v ers a signaturefrom the image. In this process an additional imageUcan also be included which is often the original and unw atermark ed v ersion of. This is due to the f act that some encoding schemes may mak e use of the original images in the w atermarking process to pro vide e xtra rob ustness against intentional and unintentional corruption of pix els. The decoding process can be e xpressed mathematically as,=.z: U 0 (1.8) The e xtracted signature will then be compared with the o wner signature sequence by a compar ator function$and a binary output decision generated. It isif there is match and"otherwise, which can be represented as follo ws.$o : :cQ "#:otherwise (1.9) 33 PAGE 51 ' Image (I ) Signature (S) Watermarked E Original Image (I) Encoder (a) W atermarking Encoder Signature(S ) C d Test Image (J) Original Signature (S) Extracted Original Image (I) D x Decoder Comparator (b) W atermarking Decoder Signature(S ) Extracted Signature(S) Comparator Original c x C d (c) W atermarking Comparator Figure 1.18. General Frame w ork of Digital W atermarking 34 PAGE 52 Where$is the correlator , $. i : 0.¤is the correlation of tw o signatures andis cer tain threshold. W ithout loss of generality w atermarking scheme can be treated as a threetupple.c;V: PAGE 53 According to Working Noninvertible Invertible Quasiinvertible Nonquasiinvertible Public Private Fragile Robust Text Video Audio Image Dual Visible Invisible Based Destination Based Source Domain Frequency Domain Spatial Application According to According to Watermarking Domain Type of Document Human Percpetion According to (a) T ypes of W atermarking Image(I) Watermarking Image(I') Original Invisible Watermarking Visible Watermarked Visible Dual Watermarked Image(I") (b) Dual W atermarking Figure 1.19. Dif ferent T ypes of W atermarks and W atermarking T echniques 36 PAGE 54 The in visiblefr a gile w atermark is embedded in such a w ay that an y manipulation or modication of the image w ould alter or destro y the w atermark [82 83 84 ]. Dual w atermark is a combination of a visible and an in visible w atermark [83 ]. In this type of w atermark an in visible w atermark is used as a back up for the visible w atermark as clear from the follo wing diagram (Fig. 1.19(b)). An in visible rob ust private w atermarking scheme requires the original or reference image for w atermark detection; whereas the public w atermarks do not. The class of in visible rob ust w ater marking schemes that can be attack ed by creating a counterfeit original (to be discussed in later sections) is called in vertible w atermarking scheme. Using mathematical notations from Section 1.9.1, an in visible rob ust w atermarking scheme.c;V: PAGE 55 Time Constrained Energy Transient Power Peak power Resource Constrained Energy HeuristicBased Minimization ILPBased Minimization Energy Delay Product Power Fluctuation Peak and Average Power Transient Power (Datapath Scheduling) Synthesis Dissertation Spatial Domain Invisible Spatial Domain Visible DCT Domain Visible DCT Domain Invisible (Watermarking Chips) Design Figure 1.20. Contrib utions of this Dissertation of lo w po wer high performance and reliability In this dissertation, we de v elop hardw are system that can insert in visiblerob ust, in visiblefragile, visible spatial domain as well as DCT domain w atermark in the image. The hardw are module can be easily incorporated in JPEG encoder to de v elop a secure JPEG encoder It may be noted that the corresponding w atermark e xtraction module has to be inb uilt in a secure JPEG decoder The secure JPEG codec can be a part of a scanner or a digital camera so that the digitized images are w atermark ed right at the origin. 1.10 Contrib utions of this Dissertation The contrib utions of this dissertation are in tw o broad cate gories, such as scheduling algorithms for lo w po wer beha vioral synthesis and the design of application specic inte grated circuits for digital w atermarking. Fig. 1.20 outlines the contrib utions of this dissertation in detail. During lo w po wer synthesis at beha vioral le v el, se v eral lo w po wer subtasks, such as, scheduling, allocation and binding are performed. In this dissertation, scheduling schemes are proposed to reduce peak po wer a v erage po wer peak po wer dif fential, po wer uctuation and ener gy at be38 PAGE 56 * + + * + + 1 2 3 4 1 2 3 4 3.3 V 3.3 V 5.0 V 5.0 V 3.3 V 3.3 V 5.0 V 5.0 V c1 c2 c3 (a) Energy Efficient Schedule (b) Peak Power Efficient Schedule Figure 1.21. Ener gy Vs Peak Po wer Ef ficient Schedule ha vioral le v el using inte ger linear programming(ILP) models and also using heuristics based algorithms. First, dif ferent po wer models are de v eloped to capture the po wer characteristics of a datapath circuit. Then, datapath scheduling schemes are de v eloped using multiple supply v oltages and dynamic frequenc y clocking (MVDFC), multiple supply v oltages and multic ycling(MVMC). Both these schemes are compared with single v oltage and single frequenc y(SVSF) scheme. T o ha v e a clear understanding of the scheduling for ener gy and peak po wer minimization, let us refer to data o w graph(DFG) in Fig. 1.21. The gure sho ws tw o dif ferent possible schedules of the same DFG using multiple supply v oltage scheme. Since, in both cases there are tw o multipliers operating atZ Z 9and tw o adders operating atA"9, the ener gy and a v erage po wer consumption of both scheduled DFGs is the same. Ho we v er the peak po wer consumption in Fig. 1.21(b) is less than that in Fig. 1.21(a). The approach in this thesis is to generate peak po wer ef cient schedules similar to the one in Fig. 1.21(b). A class of VLSI architecture are proposed for digital image w atermarking implementing a set of w atermarking algorithms. Se v eral CMOS VLSI circuits are designed and implemented as prototype circuit design, which can be icorporated in a JPEG encoder or a digital still camera. The VLSI implementation of spatial domain w atermarking architectures using"# Z 'CMOS technology is gi v en. T o our kno wledge, this is the rst w atermarking chip implementing in visiblerob ust, in visiblefragile and visible w atermarks together Also, to our kno wledge, this is the rst w ater 39 PAGE 57 marking chip ha ving spatial visible w atermarking capability In this dissertation, we also propose the architecture for DCT domain in visible and visible w atermarking algorithms. The prototype implementation of DCT domain in visible and visible w atermarking architecture using"#D'CMOS technology is gi v en in [85 ]. 1.11 Dissertation Outline The remainder of the dissertation is or ganized as follo ws: Chapter 2 describes the related w ork in the areas of lo w po wer highle v el synthesis, v ariable clocking based systems and the hardw are based w atermarking schemes. The fundamental concepts of multiple supp y v oltages, dynamic frequenc y clocking and multic ycling is introduced in Chapter 1.8. This also describes ho w ener gy / po wer reduction is obtained by use of dynamic frequenc y clocking and multiple supply v oltages in a VLSI circuit. In Chapter 3, heuristic based resource and time constrained algorithms are de v eloped for ener gy ef cient datapath scheduling. Chapter 4 discusses the datapath scheduling scheme for synthesis of ener gy ef cient high performance datapath achie v ed through ener gy delay product (EDP) minimization. In Chapter 5, the simultaneous reduction of both peak and a v erage po wer is discussed. This will also include a section on peak po wer minimization. A heuristic based framew ork is gi v en in Chapter 6 for simultaneous minimization of v arious po wer parameters. Chapter 7 elaborates transient po wer minimization through datapath scheduling using ILPBased models. In this case the c ycle dif ference po wer is modeled as absolute de viation from mean c ycle po wer (an estimate of a v erage po wer). The po wer uctuation of a datapath circuit is characterised as c ycletoc ycle po wer gradient in Chapter 8. T o achie v e the reduction in po wer uctuation of a datapath circuit, ILPbased scheduling schemes are de v eloped that minimizes mean po wer gradient (MPG). VLSI designs for digital w atermarking of images are proposed in Chapter 9. This includes three designs, one for in visible spatial domain w atermarking, one for visible spatial domain w atermarking follo wed by a DCT domain visible and in visible w atermarking chip. Conclusions and future directions of research are discussed in Chapter 10. 40 PAGE 58 CHAPTER 2 RELA TED W ORK The ener gy consumption of a CMOS circuit is dependent on the supply v oltage and the ef fecti v e switching capacitance. Se v eral datapath scheduling algorithms ha v e been proposed in the literature optimizing either one or both of the abo v e parameters for ener gy reduction. Moreo v er v ariable frequenc y or multiple frequenc y operations are also considered as options for po wer reduction. In this chapter the v arious related w orks are classied as, methods based on v oltage reduction, and those based on switching acti vity reduction. A fe w research w orks are based on using multiple, dynamic or v ariable frequenc y for synthesis of lo w po wer or high performance systems can be found in the literature. This chapter briey outline these w orks and further discuss, hardw are designs for digital w atermarking. In this chapter a brief o v ervie w of e xisting literature on ener gy and po wer reduction in VLSI circuits is presented. Section 2.1 presents e xisting w orks in the lo w po wer datapath scheduling methods for ener gy or a v erage po wer reduction using lo wer supply v oltages. The highle v el synthesis w orks that achie v e ener gy or a v erage po wer minimization by reducing the load capacitance or switching acti vity in a circuit are presented in Section 2.2. Section 2.3 presents a brief o v ervie w of literature on datapath scheduling methods for peak po wer and transient po wer reduction in a circuit. The scheduling schemes for v ariable v oltage processor core based systems are presented in Section 2.4. In the past frequenc y scaling or v ariable latenc y concepts ha v e been used for the de v elopment of either lo w po wer or highperformance systems. Section 2.5 re vie ws such research w orks proposed in the literature. The design w orks based on multiple supply v oltages are also included in Section 2.5. The hardw are based w atermarking systems are discussed in Section 2.6. 41 PAGE 59 2.1 Datapath Scheduling f or Ener gy or A v erage P o wer Reduction using V oltage Reduction It is kno wn that v oltage reduction is one of the ef fecti v e methods of po wer reduction since the po wer or ener gy consumption is quadratically dependent on the supply v oltage. In this section, we re vie w the w orks poposed from the literature using multiple supply v oltages during datapath scheduling for minimization of ener gy or a v erage po wer Johnson and Ro y [86 87 ] present a method called Minimum Ener gy Sc hedule with V olta g e Selection (MESVS) based on Inte ger Linear Programming(ILP) to optimize the schedule, supply v oltage le v els, and allocation of resources. The MESVS algorithm tak es a directed ac yclic data o w graph, the allo w able set of supply v oltages, a limit on the number of supply v oltages that can be selected, a minimum dif ference between the v oltages that can be selected, a v erage switching acti vity v alues for each datapath operation, nominal propagation delay and a v erage ener gy dissipation v alues for each datapath resource as inputs. The objecti v e function for MESVS is an estimate of datapath ener gy dissipation e xpressed as a function of supply v oltages. The outputs of the MESVS algorithm are the follo wing : (i) a datapath schedule (represented by scheduled data o w graph), (ii) an ener gy estimate, (iii) selection of optimal set of supply v oltages, (i v) assignment of supply v oltage to each operation and (v) allocation of resources to each supply v oltage. Since the dif ferent resources need to operate at dif ferent v oltages le v el con v ersion is needed. There are four possible schemes, such as omitting the le v el con v erter using a chain of in v erters, using an acti v e or passi v e pullup and using dual cascade v oltage switch (DCVS) circuit. The authors claim that ener gy sa vings in the range ofdDRis obtained compared to9operation. The other observ ation w as that the use of tw o supply v oltages can reduce po wer dissipation substantially while three supply v oltages resulted in less thanreduction compared to tw o supply v oltages. Johnson and Ro y [65 ] present an algorithm called Multiple Oper ating V olta g e Ener gy Reduction (MO VER) to minimize datapath ener gy dissipation. Ener gy sa vings ranging from"d?"are obtained with the area penalty in the range")dq". The MO VER generates one, tw o, and three supply v oltage designs for consideration by the circuit designer The user has control o v er latenc y constraints, resource constraints, the number of control steps, clock period, and the number of po wer supplies. The MO VER iterati v ely searches for the range of minimum v oltage le v els. The 42 PAGE 60 MO VER uses an ILP to e v aluate the feasibility of candidate supply v oltage selections, to partition operations among dif ferent po wer supplies and to produce a minimum area schedule under latenc y constraints once v oltages ha v e been selected. The MO VER has the follo wing phases :3determining maximum and minimum bounds on the time frame in which each operation must e x ecute3searching for minimum v oltage3partitioning datapath operations into tw o supply v oltage that are either higher or lo wer supply v oltages.3partitioning the lo wer v oltage group, for the three supply v oltage schedule. The MO VER algorithm [65 ] is similar to the MESVS algorithm [87 ] in the follo wing w ays :3both use ILP formulation3beha vior with respect to latenc y resource, ad supply v oltage constraints3both use dif fer ential cascode volta g e switc h (DCVS). The dif ference between the MO VER and MESVS tw o is that MESVS can only select a discrete set of v oltages, whereas MO VER can select a continuous range of v oltages. The ILP formulation handles timing and resource constraints and accounts for the cost if le v el shifters are used. Ho we v er MO VER and MESVS ha v e follo wing dra wbacks :3it does not address conditional branches3does not consider functional pipelining3ener gy model used is dataintensi v e which ignores the ef fect of input acti vities on the ener gy dissipation of a module3it has e xponential w orstcase comple xity and can not handle lar ge benchmarks. 43 PAGE 61 Chang and Pedram [51 88 ] present a dynamic programming technique for multiple supply v oltage scheduling. The proposed technique handles both functionally pipelined and nonpipelined datapaths and multic ycling operations. The scheduling algorithm assigns a supply v oltage le v el from a x ed set of v oltage le v els such that the ener gy consumption is minimum for gi v en constraints. In this algorithm, the le v elshifters are used for both stepup and stepdo wn of signals. It may be noted that in most of the algorithms, le v elshifters are used for stepup of signals only An a v erage sa ving of"#7Sis obtained using three supply v oltage le v els as compared with single supply v oltage le v el. The algorithm has pseudopolynomial comple xity and produces optimal results for trees and produces suboptimal for general directed ac yclic graphs. The scheduling algorithm can handle v ery lar ge data o w graphs and the results are withinqerror In [89 ], an ILP formulation and a heuristic for v ariable v oltage scheduling is presented by Lin, Hw ang and W u. The authors ha v e considered three dif ferent solutions to the problem, such as time constrained, resource constrained, and timeandresource constrained. The scheduling schemes consider v ariable supply v oltage and multic ycling. The heuristic method produces results comparable with those of the ILP method in a fraction of runtime. The time comple xity of the heuristic algorithm is £ . The proposed heuristic is an modication o v er listbased algorithm with a priority function that considers three f actor such as the po wer gain of an operation, the mobility of an operation, and the computation density The authors sho w that using dif ferent cost and delay combinations, po wer consumption in a single design can dif fer by as much as a f actor ofwhen using mix ed. Z Z 9andA"90supply v oltages. Sarrafzadeh and Raje [90 ] proposed tw o scheduling algorithms; one is a dynamic programming algorithm and other is an heuristic algorithm based on geometric algorithm. The algorithms assume both time and resource constraints as inputs. The resource constraints is the number and type of each functional units and their operating supply v oltage. The algorithms assume only tw o supply v oltages, such asZ Z 9andA"9. The aim of the algorithms is to maximize the usage of the functional units at the lo wer supply v oltages while satisfying the time constraints. Letbe the number of nodes,be the time constraint,is gi v en resource constraint,is latenc y of a functional unit that run at a supply v oltage of9 ~. The running time of the dynamic programming 44 PAGE 62 T able 2.1. Datapath Scheduling Schemes using Multiple Supply V oltages Proposed Optimization Constraints Operating V oltage T ime Scheme Method Used Assumed Le v els Comple xity Johnson and ILP T ime ./A"9=A"90 Expoential Ro y [86 87 ] Johnson and ILP T ime ./A"9: Z Z 9:hA9Y0 Expoential Ro y [65 ] Chang and Dynamic T ime ./A"9: Z Z 9:hA90 PseudoPedram [51 88 ] Programming Polynomial Lin, Hw ang ILP and T ime and ./A"9: Z Z 9V0 Expoential and W u [89 ] Heuristic Resource Sarrafzadeh Dynamic Prog T ime and .A"9: Z Z 9V0 M7 C Ao C4 and Raje [90 ] Geometric Resource [.c$ $10 K umar and Stochastic Resource ./A"9: Z Z 9:hA9Y0 C Bayoumi [91 92 93 ] Ev olution Elgamel and Genetic T ime and ./A"9: Z Z 9:hA9Y0 N A Bayoumi [94 ] Algorithms Area Shiue and ListBased T ime and ./A"9: Z Z 9V0or Polynomial Chakrabarti [95 96 ] Resource ./A"9: Z Z 9:hA9Y0 Manzak and Lagrangian T ime and ./A"9: Z Z 9: C and Chakrabarti [97 ] Multiplier Resource A9:79V0 C Manzak and ListBased T ime and ./A"9: Z Z 9: C C Chakrabarti [98 ] Resource A9:79V0 scheduling algorithm is C Ao C. If$is the number of control steps, then the time comple xity of the geometric algorithm is.b$ q$^0and can handle more than tw o supply v oltages. The authors reported po wer reductions in the range of Z DRnd Z for v arious highle v el synthesis benchmarks under v arious resource and time constraints. K umar and Bayoumi [91 92 93 ] proposed scheduling schemes using multiple supply v oltages and multic ycling. The algorithms essentially has tw o phases, initialscheduling and rescheduling. During initial scheduling parallelism is e xploited and the rescheduling uses an iterati v e approach, which is based on stochastic e v olution. Le v elcon v erters are used when a functional unit operating at lo wer v oltage dri v es a functional unit operating at higher v oltage. The timecomple xity of the scheduling algorithm is C . The authors report po wer sa vings uptoR"for three supply 45 PAGE 63 v oltage le v els of.2A"9: Z Z 9andA90. The po wer o v erhead due to the le v elcon v erters is in the range"dand the area o v erhead is in the range"'d!. Elgamel and Bayoumi [94 ] use genetic algorithms to solv e multiple supply v oltage scheduling problem with multic ycling operations. The proposed scheme assumes unscheduled data or control o w graph, datapath component library area and time constraints as inputs and minimize a v erage po wer The algorithms simultaneously solv es scheduling, allocation and binding. Po wer reduction as high asRis reported. The results do not consider the po wer o v erhead due to the le v el con v erters. Shiue and Chakrabarti [95 96 ] discuss a resource constrained and a latenc y constrained listbased scheduling algorithms using multiple supply v oltages. The scheduling scheme consider the ef fect of switching acti vity The algorithms use heuristics to reduce po wer consumptions in the le v elcon v erters. The list based algorithms assign control steps to nodes based on their priorities. The priority of a node is a function of v arious parameters, such as depth, mobility switched capacitance, interconnection comple xity and need for a le v el shifter The le v el shifters are used between a lo wv oltage resource and a highv oltage resource for steppingup the signal. The proposed algorithms are of polynomial timecomple xity The proposed schemes achie v e signicant po wer reduction when the operation v oltages are./A"9andZ Z 9)0or.29: Z Z 9:andA9)0. The Lagrangian multiplier method has been used by Manzak and Chakrabarti [97 ] to de v elop resource and time constrained scheduling algorithms. The algorithms which use Lagrangian multiplier method in an iterati v e f ashion, are based on ef cient distrib ution of slack among the nodes in the DFG. Ifdenotes the number of nodes anddenotes the latenc y the time comple xity of the tw o v ersions of the proposed algorithms are C and C . The C algorithm results better sa vings in ener gy compared to the C algorithm. A v erage po wer or ener gy reduction ofZ Shas been obtained when the latenc y constraint istimes the critical delay and is impro v ed toDRAwhen the latenc y constraints relx ed totimes the critical path delay The time constraint, resource constraint consisting of the number of resource of each type operating at specic v oltage, delay and ener gy v alues are gi v en as inputs to the algorithm. The resources are 46 PAGE 64 allo wed to operate at one of supply v oltages from.A"9: Z Z 9:hA9:and90. The le v el shifters are used whene v er stepup of signal is necessary Manzak and Chakrabarti [98 ] proposed listbased latenc y and resource constrained scheduling algorithms. The scheduling uses priority function based on the number of a v ailable resources, the dif ference between the actual number of c ycles left and estimated number of c ycles required to schedule remaining nodes. The algorithms consider the switching acti vity of nodes. The resources are allo wed to operate at one of supply v oltages from.2A"9: Z Z 9:hA9:and9)0. The a v erage po wer or ener gy reduction isDSAqwhen the latenc y constraint istimes the critical delay and the a v erage po wer or ener gy reduction isDARwhen the latenc y constraint isA"times the critical delay The timecomple xity of the algorithm is C C , whereis the number of resources, andis the latenc y A comparati v e vie w of the abo v e discussed algorithms which use v oltage reduction for a v erage po wer or ener gy reduction is gi v en in T able 2.1. 2.2 Switching Acti vity Reduction During HighLe v el Synthesis In this section, we discuss the w orks on datapath scheduling which use capacitance reduction to reduce a v erage po wer or ener gy An o v ervie w of the discussed methods is gi v en in T able 2.2, where the percentage po wer reduction is the a v erage data. K umar Katk oori, Rader and V emuri [99 100 ] present a prole dri v en approach to highle v el synthesis called as Pr ole Driven Synthesis System (PDSS). The inputs to the PDSS are a subset of VHDL and constraints in terms of clock period and area. The PDSS generates a constraintsatisfying design with the least amount of estimated switching acti vity In this system, the input specication is proled to collect data for v arious operations and carriers using a user specied input set of v ectors. The switching acti vity for each module set is estimated by using this proled data and the ra w switching acti vity data of all modules in the library The module set with minimum estimate of po wer consumption is chosen for further synthesized. The goal of proling is to gather the follo wing data : 47 PAGE 65 3F or each node (operation), the number of times the node is e x ecuted for a gi v en proling stimuli is determined and input v ectors used as prole stimuli. This number is called the e v ent acti vity of the operation node.3F or each edge, the number of times the edge is tra v ersed during e x ecution is determined. This number is called the transaction acti vity of the edge.3F or each edge, the number of times the v alue on the edge has changed is determined. This number is called the e v ent acti vity of the edge. The authors claim that the results obtained are within an accurac y of4"of the actual switching acti vity measured at the switch le v el implementation of the design. Raghunathan and Jha [101 ] present a comprehensi v e lo wpo wer datapath synthesis system that performs the v arious highle v el synthesis tasks with the aim of reducing po wer consumption in the synthesized datapath. The authors call the system as SCALP The system considers both supply v oltage and switching capacitance to reduce the po wer consumption. The authors claim that SCALP estimates switching capacitance accurately handles di v erse module libraries and utilizes comple x scheduling constructs such as multic ycling, chaining, and structural pipelining. The input to the SCALP is a control data o w graph (CDFG), input sampling period, and a library of components to be used for datapath implementation. The SCALP minimizes po wer consumption both by v oltage scaling and switching capacitance reduction. This is done by rst pruning the set of candidate supply v oltages to a small set of supply v oltages. F or each supply v oltage in the pruned set, a datapath is synthesized that has minimal capacitance. The best solution among these datapaths in terms of po wer consumption is then chosen. Raghunathan and Jha [102 ] are the rst researchers to purpose the allocation method for lo w po wer The method is based on iterati v e impro v ement of some initial solution. The authors assume random input in a structurally pipelined design. The method can also handle nonrandom input sequences. The method is implemented in the frame w ork of Genesis beha vioral synthesis system[103 ]. In this system, re gister and module allocations are performed simultaneously while minimizing the amount of interconnect needed. A lifetime analysis is performed for the scheduled 48 PAGE 66 CDFG. T w o v ariables are said to be compatible and can share hardw are resources if the y are not ali v e at the same time. Similarly tw o operations are compatible if the y are not performed at the same time. Allocation is based on a weighted graph called compatibility graph (CG). Initially each v ariable and operation corresponds to a node in the CG, with undirected edges connecting compatible pairs. W eights are assigned to edges in the CG to indicate the preference on the tw o v ariables or operations for sharing the same resource. A single step of allocation selects the edge in the CG with the highest composite weight, and mer ges the tw o nodes it joins, maps the cor responding v ariable (or operation) to the same module (re gister). If tw o or more edges ha v e the same composite weight, the tie is brok en based on the corresponding transition acti vity weights (or some cases arbitrarily). Po wer reduction is achie v ed by the help of tw o f actors, capacitance and transition acti vity Capacitance is reduced by minimizing the number of functional modules, re gisters and multiple x ers. The allocation scheme selects a sequence of operations (v ariables) for a module or re gister such that the transition acti vity is reduced. Chiou, Muhammand and Ro y [104 ] propose scheduling and allocation method that reduce po wer consumption of data intensi v e applications by minimizing switching acti vity The main idea of the synthesis technique is to reduce the signal strength dif ference among the inputs of shared resources. The signal strength is deri v ed from w ordle v el statistics. The authors ha v e proposed a formula that relates switching po wer with resource sharing as follo ws. Switching increment Dif ference in switching acti vity with and without sharing Switching acti vity without sharing (2.1) It is observ ed that sharing resources between tw o operations with high signal similarity will lo wer switching acti vity and hence reduce switching po wer This observ ation serv es as the major principle behind the proposed scheduling and allocation techniques. The proposed scheduling algorithm is heuristic based and uses greedy approach in making module selections. A v erage po wer reduction uptoSis obtained using the proposed techniques compared to the con v entional ones. A comprehensi v e highle v el synthesis system is proposed by Khouri, Lakshminarayana and Jha [105 ] to synthesize both controlo w intensi v e and dataintensi v e circuits. The system handles 49 PAGE 67 con v entional synthesis tasks, such as scheduling, module selection, and resource sharing. Moreo v er po wer conscious structuring of multiple x er netw orks, which are predominant in controlo w intensi v e circuits, is the k e y additional feature in the system. Experimental results demonstrate po wer reduction ofDfor controlo w intensi v e benchmarks as compared to9f PAGE 68 T able 2.2. HighLe v el Synthesis Schemes using Switching Acti vity Reduction Proposed Synthesis T asks Methods T ime % Po wer W ork Performed Used Comple xity Reduction K umar Katk oori, Rader Scheduling, Re gister Simulation N A N A and V emuri [99 100 ] Optimization, etc. of DFG Raghunathan T ranformation, ScheIterati v e Polynomial 4.6 and Jha [101 ] duling and Allocation Impro v ement Raghunathan Allocation Simulation N A 14.6 and Jha [102 ] Chiou, Muhammand Scheduling and Heuristic Polynomial 30.13 and Ro y [104 ] Allocation Based Khouri, LakshmiScheduling and Heuristic Polynomial 22 narayana and Jha [105 ] Resource Sharing Henning and Chakrabarti Scheduling and Intuti v e Polynomial 15 [106 107 ] Allocation Heuristic Shiue and Chakrabarti Resource Inte ger Linear Exponential 24.08 [108 ] Binding Programming Musoll and Cortadella Scheduling and ListBased C 6.67 [38 ] Resource Binding Algorithm Lundber g, Muhammad, N A Hierarchical N A 14.93 Ro y and W ilson [109 110 ] Shin and Lin Resource Heuristic Polynomial 7.84 [111 ] Allocation Monteiro, De v adas, Scheduling HYPER [112 ] N A 22.43 Ashar and Mauskar [113 ] Cherab uddi, Bayoumi P artitioning and Stochastic Polynomial 23.89 [114 ] Binding Ev olution Lee, Lee, P ark Scheduling Heuristic Polynomial 16.5 and Hw ang [115 ] Gupta and Scheduling F orceDirected 16.4 Katk oori [116 ] Heuristic Muruga v el and Scheduling Game Theory Exponential 13.9 Ranganathan [117 ] Binding 51 PAGE 69 reducing the transitions of their input operands. The po wer consumption of a functional unit is di vided into useful and useless po wer Useful po wer is consumed when an operation is e x ecuted and useless po wer is the consumption due to an input transition while the functional unit is idle. The algorithms proposed reduces both useful and useless po wer consumption. The scheduling algorithm is listbased in which the operation priority is set in such a w ay that operations sharing the same operand are scheduled in control steps as close as possible. F ornumber of operations andnumber of functional units, the running time of the proposed lo w po wer list scheduling (LPLS) is C . The algorithm for resourcebinding is based on clique partition that reduces po wer consumption by taking the a v erage Hamming distance (H) among the v ariables. F or tw o operandsand, if.:<0is the Hamming distance andmis the v alue of operandin c ycle, the a v erage Hamming distance is dened as follo ws.Pss.b0 )F'M) F (2.2) The a v erage Hamming distance is used as a measure of ener gy in/operation. Po wer reductions in the range ofdRha v e been reported. Lundber g, Muhammad, Ro y and W ilson [109 110 ] proposed switching acti vity models and use them to synthesize lo w po wer digital signal processing systems. The models can be easily inte grated in an y CAD tool. The accurac y of estimates obtained using the proposed models is reported to be within. Switching acti vity reductions upto"is obtained using the proposed approach. The models consider switching occuring at the output of functional units, b ut do not consider the capacitance dif ference due to the interconnect lengths. The bits of a signal are di vided into three re gions, such as lo w switching re gion, high switching re gion and the re gion in between. The lo w switching re gion consists of the most signicant bits (MSBs), the high switching re gion is the least signicant bits (LSBs) and the inbetween re gion is considered to be a linear transition connecting the other tw o re gions. Using these models, the output switching of basic b uilding blocks, such as onebit delay halfadder fulladder ha v e estimated. It is assumed that the number 52 PAGE 70 of internal transitions of a halfadder and a full adder is twice and thrice, respecti v ely more than that of an onebit delay Shin and Lin [111 ] propose an ef cient resource allocation algorithm that minimizes switching acti vity to reduce the dynamic po wer consumption of the DSP datapath. LetIbe a certain binary input sequence. Suppose,is the length ofIandis the number of 1s in the input sequenceI. The a v erage switching acti vity ofIis calculated as follo ws. w/um  y4mF n n n (2.3) F or e xample, for a input sequence"D4"D4"D"D"D", 4"and . The input to the allocation algorithm is a scheduled data o w graph. The algorithm e x ecutes all control steps, and compare functional unit with lo w po wer consuming re gister and interconnects of DSP circuits. The algorithm is of polynomial time comple xity Po wer reduction uptoRAreported using the algorithm. Shutdo wn techniques are used by Monteiro, De v adas, Ashar and Mauskar [113 ] to eliminate switching acti vity and hence po wer dissipation. The conditions under which the output of a module is not used for a particular c ycle is identied and the input latches for that module is disabled when the conditions are met. The proposed scheduling algorithm maximizes the shutdo wn period of functional units. The scheduling algorithm is time and resource constrained. The techniques, such as multiple xor reordering, pipelining are proposed to impro v e po wer management under these stringent contraints. The po wer reduction as high as6Dhas been reported. Cherab uddi and Bayoumi [114 ] propose partitioning and binding algorithms that minimize the switching acti vity of functional units and global b uses for singlechip applications. Cherab uddi, Bayoumi and Krishnamurthy [118 ] e xtend the same w ork for multichip applications. The authors ha v e used a stochastic e v olution based technique for partitioning. Po wer reduction up to"has been reported. The switching acti vity is computed by iterati v ely changing the input data pattern and a switching acti vity matrix is constructed. The partition algorithms partition the data o w graph such that each one of them can be implemented in dif ferent chips of multichip modules (MCMs). The stochastic e v olution approach is used in the partition algorithm for f aster con v er 53 PAGE 71 gence. Scheduling and binding steps are performed for each mo v e on the partitioning. An incompatible graph is constructed from the original graph for resource allocation purpose. T o nd optimal solutions for lo wpo wer binding, a multistage graph is formulated and dynamic programming approach is used. The total switching acti vity of a schedule is calculated as the summation of switching acti vity of the chips on the module and the switching acti vities on the interchip b uses. Lee, Lee, P ark and Hw ang [115 ] propose a scheduling algorithm that reduces the switching acti vity of the functional units under area or time constraints and thus reducing the po wer consumption. The switching acti vity is minimized by scheduling operations such that the Hamming distance between the v ariables appearing in the input and output port is minimum. The functional unit allocation is performed by partitioning the operations in the gi v en beha vioral description and the switching acti vity is k ept at minimum. After allocation is performed, the scheduling algorithm attempts to schedule the operations using the minimum number of functional modules. The algorithm is of polynomial time comple xity The results indicate that switching reduction of7Ain a v erage can be obtained. Gupta and Katk oori [116 ] propose a scheduling algorithm based on the original forcedirected scheduling algorithm proposed in [24 ]. F or a gi v en data o w graph and input data en vironment the DFG is proled with the representati v e data streams. The probability of selecting a combinations among the operations which w ould share a resource is e vluated. Assuming that the force equation is& NA, the switching capacitance inside a module is modeled as spring constantand the probability of selecting such an combination is modeled as displacement. F ornumber of possible time steps andnumber of operations, the time comple xity of the proposed algorithm is . It may be noted that the original forcedirected scheduling algorithm has running time of C . The authors ha v e reported a po wer reduction of7Ao v er the con v entional forcedirected algorithm. Muruga v el and Ranganathan [117 ] describe a game theory based algorithm for a v erage po wer minimization during beha vioral synthesis using lo w po wer binding. The techniques of functional unit sharing, path balancing, and re gister assignment are incorporated within the binding algorithm for po wer reduction. F or the binding algorithm, each functional unit in the datapath is modeled as 54 PAGE 72 T able 2.3. Relati v e Performance of V arious Schemes Proposed for Peak Po wer Minimization Proposed Synthesis T asks Methods T ime % Po wer W ork Performed Used Comple xity Reduction Martin and Knight Scheduling Genetic N A 40.360.0 [41 44 ] Assignment Algorithms Shiue and et. al. Scheduling ILP Exponential "#"^d!A" [119 120 121 108 ] F orce Directed ¤£ Raghunathan, Scheduling Data Monitor N A 17.4232.46 and et. al. [47 ] Operations a player bidding for e x ecuting an operation with the estimated po wer consumption as the bid. The operations are assigned to the functional units such that the number of inputs to the functional units that change is minimized thus reducing switching acti vity The proposed algorithm yields po wer reduction impro v ement of Z Swithout an y increase in area or delay o v erhead. 2.3 Datapath Scheduling f or P eak P o wer Reduction Fe w research w orks ha v e appeared addressing peak po wer minimization at beha vioral le v el. In this section, we briey discuss those w orks and gi v e a o v ervie w of their relati v e performance in T able 2.3. Martin and Knight [44 41 ] ha v e proposed a scheme which combines the SPICE simulations with a beha vioral synthesis tool to estimate and optimize digital ASIC' s peak po wer consumption. SPICE is used to measure the po wer consumption accurately The beha vioral synthesis tool is used for simultaneous assignment and scheduling such that the use of po wer in each clock c ycle is minimum. Genetic algorithms are used in the beha vioral synthesis tool for optimization. The author claim that genetic algorithms ha v e adv antages o v er the other con v entional optimization tools since the y ne v er get stuck in local minima and do not need ne tuning. The proposed synthesis tool can minimize the follo wing parameters.3a v erage po wer with area, delay and peak po wer constraints3peak po wer with area, delay and a v erage po wer constraints 55 PAGE 73 3delay with area and peakor a v erage po wer constraints3area with delay a v erageand/or peakpo wer constraints3an y combination of area and po wer as weighted formula The optimizer searches for the best combination of architecture and schedule while satisfying all gi v en constraints. The y reported peak po wer reduction in the range of"^d", which comes at the cost of"# Z dADpenalty in a v erage po wer The w ork also considers mix ed supply v oltage scenario. Z Z 9:hA"9V0. It is reported that the time penalty is lar ge if the circuit is operated at lo w v oltage, b ut signicant po wer reduction is achie v ed. Shiue [119 120 ], Shiue and Chakrabarti [108 ], and Shiue, Denison and Horak [121 ] propose dif ferent datapath scheduling schemes to minimize peak po wer at beha vioral le v el. In [108 121 120 ] inte ger linear programming formulations are proposed, whereas [119 ] also includes a modied force directed scheduling algorithm. The running time of the proposed modied force directed scheduling algorithm is ¤£ , if¤is the number of control steps andis the number of nodes. The scheduling schemes in [119 ] minimize peak po wer while satisfying time constraint. The scheduling algorithms in [108 121 120 ] minimize both peak po wer and peak area while satisfying latenc y constraints. The simultaneous minimization is performed by the help of multicost objecti v e using the user dened weighting f actors. The formulation consider multic ycling and pipelining and single supply v oltage design. Peak po wer reductions in the range of"Pdha v e been reported after scheduling and pipelining. The reduction in peak area is also in the range of"'d!. In [47 ] a high le v el synthesis approach is presented by Raghunathan, Ra vi, Raghunathan, and Lakshminarayana for transient po wer management. The po wer optimization includes the peak po wer and peak po wer dif ferential. The authors adv ocate the need for judicious choice of transient po wer metric to a v oid area and performance o v erheads. The authors propose the use of data monitor operations for simultaneous reduction of peak po wer and peak po wer dif ferential. The proposed scheduling algorithm tak es constraints on po wer characteristics in addition to con v entional resource 56 PAGE 74 and time constraints. In this scheme, peak po wer reduction in the range ofqVd Z has been obtained. The reduction in the peak po wer dif ferential is in the range ofDdDR. 2.4 Scheduling f or V ariable V oltage Pr ocessor The v ariable v oltage processor has special instructions for controlling po wer The supply v oltage and clock frequenc y can be changed at an y time by the instructions in the application programs or operating systems. Examples of such processors are T ransmeta crusoe, Itsy Intel StrongARM, etc. The clock frequenc y is adjusted according to the supply v oltage to guarantee correct operation (gure 2.1). The four approaches to manage v ariable v oltage processor are as follo ws [122 ] : (1) hardw are based (no information), (2) interv albased (load information only), (3) inte grated schedulers (all operating system statistics), and (4) applicationspecic (complete kno wledge). In this section, we discuss the scheduling algorithms proposed for v ariable v oltage corebased systems under the assumption that the operating system has a v oltage scheduler (as in case 3). W e also discuss instruction scheduling for v ariable v oltage processor which assigns v oltage and frequenc y at complier le v el. The v ariable scheduling scheme may be either static (of fline) or dynamic (online), b ut the instruction scheduling schemes are of fline. The v ariable v oltage or instruction scheduling schemes be either preempti v e or nonpreempti v e. It may be noted that v ariable v oltage processors also referred as v ariable frequenc y processors. An o v erall vie w of the scheduling algorithms is gi v en in T able 2.4. Ishihara and Y asuura [123 ] propose a static v oltage scheduling algorithm using inte ger linear programming formulations. The processor core can ha v e single supply v oltage at each instant of time, which can be changed dynamically The a v erage switching capacitance $'per c ycle of}ADis calculated as follo ws. $ ) # # n (2.4) where,;)$is the number of e x ecution c ycles forx#,*is the number of gates in the processor ,$ is the load capacitance of a gate , and mis the switching count of while the  yc ycle 57 PAGE 75 Figure 2.1. V ariable V oltage Processor Operation : V oltage Vs Frequenc y [122 ] ofx# is e x ecuted. On the basis of the assumption that the processor can use only a small number of discretely v ariable v oltages, the authors ha v e proposed man y theorems, some of them are gi v en belo w .3F or a processor that can use consecuti v e v oltage, only a single v oltage can minimize ener gy consumption satisfying the time constraints.3The v oltage scheduling with at most tw o v oltages minimizes ener gy consumption usnder an y time constraints if a processor can use only a small number of discrete v oltages. The authors ha v e reported ener gy reduction upto". V arious processors with minimum oper ating v oltage"#S9and maximum operating v oltageZ Z 9are used in the e xperiments. Okuma, Ishihara, and Y asuura [124 125 ] propose both static and dynamic v oltage scheduling in the abo v e frame w ork. Hong, Potk onjak, and Sri v asta v a [126 ] propose preempti v e v ariable v oltage scheduling for realtime tasks comprising of both online and of fline w orkloads. The scheduling scheme ensures that the deadlines are met. The v ariable v oltage is generated using DCDC switching re gulator The authors pointout that the time o v erhead for clock frequenc y stabilization is ne gligible. A periodic 58 PAGE 76 (of fline) task is characterized asm.$mz:<)m}:<%m20, where$omis the w orstcase computation time at the highest v oltage,Vmis the hard deadline, and%imis the period. Similarly a sporadic (online) task is characterized as m#.cPm}:$mz:<)m0, wherePmis the arri v al time,$tmis the computation time at highest v oltage,mis the hard deadline. The online scheduling algorithms is heuristic based and has.bH0timecomple xity fornumber of tasks. T w o algorithms are proposed that can handle both online and of fline tasks. The running time of the optimal algorithm is. p OH0, wherepis the total number of requests in each hyperperiod of theperiodic tasks andis the number of online tasks that ha v e been accepted, b ut uncompleted. The suboptimal heuristic algorithm has timecomple xityV.bH0. The heuristicbased schedulers use a priority task queue in which the tasks are ordered on the EarliestDeadlineFirst (EDF). Po wer reduction upto"reported by the authors. In [127 ], Hong, Kiro vski, Qu, Potk onjak, and Sri v asta v a propose a nonpreempti v e scheduling heuristic of the same problem. Mansour Mansour Hajj, and Shanbhag [128 ] propose time constrained and resource constrained instruction scheduling algorithms considering latencies of instructions for a v ariable v oltage processor The RISC architecture assumed has an inte ger unit and a oating point unit. The inte ger unit has a pipelined inte ger adder multiplier and di vider Similarly the oating point unit has a pipelined oating point adder multiplier and di vider The operating v oltages assumed areA"9: Z Z 9:hA"9:andA"9. The architecture also assumed to ha v e load and store instruction for accessing memory The proposed algorithm is listbased heuristic. The algorithm uses a po wer gain metric at each nodemdened as,,nm n r § n q0 n r §(2.5) where,% m ./90is the po wer consumed by mwhen scheduled at v oltage9and mb lk<is the maximum delay occured by reschedulingm. The node with highest,mis selected for rescheduling. The algorithm maintains a pr olo gue of instructions preceedingmand an epilo gue of instructions follo winggmin a data o w graph constructed for an instruction set. The timecomple xity of the algorithm is . Po wer sa vings up toDhas been reported using this technique. 59 PAGE 77 T able 2.4. Scheduling Algorithms for V ariable V oltage Processor Proposed W orking Static or Method Running % Po wer W ork Le v el Dynamic Used T ime Sa vings Ishihara and OS Static ILP Exponential 70 Y asuura [123 ] Okuma, Ishihara, OS Static ILP Exponential 56 and Y asuura [124 125 ] Dynamic Heuristic N A 58 Hong, Potk onjak, OS Dynamic Heuristic V. p OH0 20 and Sri v asta v a [126 ] Hong, Kiro vski, System Static Heuristic V.b 0 25 and et. al. [127 ] Mansour Mansour Circuit and Static Listbased 56 and et. al. [128 ] Beha vioral Heuristic Aze v edo, Issenin, Compiler Static Heuristic N A 82 and Cornea [129 130 ] Sw aminathan and OS Dynamic ILP Exponential 15 Chakrabarty [131 ] Dynamic Heuristic N A N A Sw aminathan and OS Dynamic Prunning Polynomial N A Chakrabarty [132 ] Hsu, Kremer Compiler Static Heuristic N A 70 and Hsiao [133 134 ] Pering, Burd OS Static Heuristic V.b0 80 and Brodersen [58 ] Lee and [135 ] OS Static Heuristic 87 C ¡ § ¡ u 54.5 Krishna [135 ] Dynamic Heuristic N A 65.6 Pouwelse, LangenOS Dynamic Heuristic 50 doen, and Sips [52 ] Y ao, Demers, OS and Static Heuristic C N A and Shenk er [136 ] Circuit Dynamic N A N A N A Luo and Jha OS Heuristic N A 50 [137 ] Luo and Jha [138 ] OS Static Polynomial N A Dynamic 60 PAGE 78 In [129 130 ], Aze v edo, Issenin and Cornea propose a dynamic v oltage scaling technique that w orks at the compiler le v el instead of the operating system le v el. Checkpoints are introduced at compilation time which indicate places in the code where the processor speed and v oltage should be recalculated. T w o heuristic based algorithms are proposed. One heuristic results ener gy reduction ofRDcompared to the program e x ecution without D VS. The proposed heuristic algorithms are po wer and time constrained and is di vided into tw o major phases, such as ahead of time proling phase and runtime po wer scheduling phase. The four dif ferent clock frequenc y and v oltage combinations supported are"D"*+^dA9,"D"* 1d R9,"D"*+^d 9, andZ "D"* dq9. Online scheduling algorithms for periodic tasks are proposed in [131 ] by Sw aminathan and Chakrabarty The authors describe an intger linear programming (ILP) and a heuristic algorithm. The heuristic algorithm is based on EarliestDeadlineFirst (EDF) approach. The CPU assumed has tw o speeds and the real time tasks are nonpreempti v e. F or e xample, for tw o supply v oltagesD9andZ Z 9the operating frequencies are4"D"*and"D"*+respecti v ely The supply v oltage to the CPU is controlled by operating system and the operating system may dynamically switch the v oltage during runtime. The ILP based approach results a po wer reduction of approximately4" d7as compared to the EDF method. In [132 ], the same authors ha v e proposed a polynomial timecomple xity prunning based algorithms called ener gyoptimal de vice scheduler (EDS) in the same frame w ork. The prunning is performed based on time and ener gy T emporal prunning is done when a partial schedules results in missing deadlines. Hsu, Kremer and Hsiao [133 134 ] propose a compilation process that f aciliates dynamic frequenc y and v oltage scaling for ener gy reduction with mar ginal e x ecution time o v erhead. It is a kno wn f act that the modern architectures e xploit temporal and spatial locality F or the programs (computations) with less temporal / spatial locality the processors often stall, w aiting for the memory to pro vide data. This leads to the principle behind this w ork, which slo ws do wn the CPU that w ould stall or idle using ne w compiler strate gy The total program e x ecution timeis di vided into 61 PAGE 79 three portions as gi v en belo w CPUBusyOMemoryBusyOBothBusy (2.6) If the CPU speed is reduced by a f actor, then ne w e x ecution time becomes,ne w CPUBusyO "!M.MemoryBusyOBothBusy:hPBothBusy0(2.7) In order to ha v e the ne w e x ecution time v ery close to the original one so that the time penalty is minimal, the foll wing four condition must be satised : (i)./dq0CPUBusy q, (ii)Q"[OMemoryBusy BothBusy (iii) memory latenc y is di visible by, and (i v)has an inte gral v alue. The follo wing compilation strate gy has been proposed by the authors : (1) Program regions are identied as scheduling candidates, (2) Expected performance is modeled that in v olv es computation of CPUBusy MemoryBusy BothBusy and, and (3) V oltage / frequenc y scheduling instructions are generated for each scheduling candidate. The authors ha v e reported ener gy reduction ofZDZ =d!"under the assumption of transmeta Crusoe processor Pering, Burd, and Brodersen [58 ] introduce a v oltage scheduler as a part of operating system. The scheduler determines appropriate operating v oltage by analyzing application constraints and requirements. The simulated lpARM processor is based on ARM8 core and designed to operate betweenq9andZ Z 9, with operating frequenc y between4"*+and4"D"*. An EarliestDeadlineFirst (EDF) polic y is used for temporal scheduling, which is optimal for x edspeed systems. The v oltage scheduler needs support for four types of hardw ares, such as speedcontrol re gister processor c ycle counter w allclock time and system sleep control. The proposed scheduling algorithm assumes that all tasks are sporadic and calculate the minimum speed necessary to complete all tasks assumming that the y are all currently runable. This speed is calculated as, speed "!v # $ w ork deadline current time%'& m)(6F (2.8) 62 PAGE 80 when the threads are sorted in EDF order The algorithm has running time ofV.b0. Ener gy reduction up toR"has been reported. Both static and dynamic v ariable v oltages scheduling algorithms are proposed by Lee and Kr ishna [135 ]. The processor is assumed to run either at high or lo w v oltage and correspondingly at high and lo w frequenc y The rst algorithm assigns each task to either highv oltagef as tcl oc k (Hmode) or lo wv oltageslo wclock (Lmode) operation modes while meeting all deadline requirements. On the other hand, the dynamic scheduler switches operational modes based on the accumulated processing w orkload. In case a task completes before its deadline then the dynamic algorithm reclaims the unused processing time and use less of the highv oltagef astclo ck mode. When the processor switches between the tw o modes, there is a switching interv al for the v oltage re gulator and the PLL clock generator to complete the mode change and the processor does not function during that time interv al. Let us assume that there aretasks, task@, taskC, .... taskF, which are numbered in decreasing priority order Let$mbe the w orstcase e x ecution time of taskmwhen the processor is running in Lmode,mbe the deadline before which taskmmust be completed, andmbe the minimum time interv al between tw o consecuti v e instances of taskm. It may be noted that$om)mum. Ifis the relati v e processing speed of Hmode with respect to the Lmode (+*W), then the scheduling problem is to partition the task into tw o disjoint subsets such that@ m), ¡ O m,n ¡ @/.xF d and m), ¡ is minimized. The timecomple xity of the scheduler is C ¡ § ¡ u , wherelok<is the maximum andlmFis the minimum ofmrespecti v ely F or static scheduling, a v erage po wer sa vings in the range of Z 'dand for dynamic scheduling, a v erage po wer reduction in the range ofdDAare obtained. Pouwelse, Langendoen, and Sips [52 ] propose a heuristic called ener gy priority scheduling (EPS) that arranges the tasks as per the deadline (ascending order priority). In this scheme, the lo wpriority tasks are scheduled rst since the y can be preempted to mak e room for the highpriority tasks. The ener gy priority scheduler is online heuristic that follo ws an incremental approach and dynamically adjusts the clock schedule when ne w tasks arri v e and old tasks complete or are preempted. The w orstcase running time of the proposed heuristic is . The algorithm is implemented as a part of complete system consisting of hardw are, OS, clock scheduler and ap63 PAGE 81 plications. The hardw are is designed using a StrongARM SA1100 processor that supports clock speeds in the rangeDSdD#7* Ener gy reduction up to"has been reported. In [136 ], Y ao, Demers and Shenk er in vstigate v arious methods for reducing ener gy consumption, both at circuit and at operating system le v el. The authors also propose an of fline scheduling algorithms that e x ecutes the job between its arri v al and deadline such that for a set of jobs, the ener gy consumption is minimum. An online algorithm has also been proposed. Assuming thatis the set of jobs, for an y job021, ifis the arri v al time,3 PAGE 82 2.5 Design and Synthesis f or Lo wP o wer or HighP erf ormance V ariable V oltage / Fr equency / Latency and Multiple V oltage Based Systems In this section, we discuss the research w orks proposed in the current literature that deal with multiple supply v oltages, v ariable v oltages (frequenc y) or dyanamic clocking frequenc y based systems designed for lo w po wer or high performance applications. An o v ervie w of the proposed w orks is gi v en in T able 2.5. In the table, for lo wpo wer w orks percentage reduction in po wer is gi v en and for the highperformance w orks percentage impro v ement in performance is tab ulated. Usami, Igarashi and et. al. [66 68 69 ] propose multiple supply v oltage based techniques for lo w po wer media processor design. The method in v olv es a combination of clustered v oltage scaling and ro wbyro w optimization of po wer supply The number of le v el con v erters used in the design is minimized because of the clustered v oltage scaling. At the same time, the clustered v oltage scaling technique maximizes the number of lo w9gf PAGE 83 Z Z 9andS9supply v oltage andD*+main clock frequenc y The po wer reduction obtained isDwith an area o v erhead of7. Automated lo wpo wer techniques ha v e been proposed in [68 69 ] for the same design methodology The po wer reduction in the clock tree is Z as reported in [68 ]. A design technique combining a v ariable supply v oltage scheme and abo v e clustered v oltage scaling is proposed in [67 ]. Po wer reduction ofDis obtained when the design methodology is applied to a video codec design. Ranganathan, V ijaykrishnan, and Bha v anishankar [59 60 140 141 ] introduce the concept of dynamic frequenc y clocking (DFC) and use it in designing highperformance image processing architectures. The y propose a SIMD (single instruction multiple data) architecture for realtime image processing applications using dynamic frequenc y clocking. The VLSI chip de v eloped using the proposed architecture w as implemented using Cadence tool. The chip operates in the frequenc y range of"d"D"*. The DFC scheme is more suitable for data o w intensi v e application (such as DSP and image processing). The DFC scheme is a combination of three concepts : recongurable architecture, frequenc y synthesizer and clock di viding strate gy In the recongurable architecture, frequencies are switched as the circuit changes while in DFC scheme, frequenc y switching occurs based on the units being used. In the clock di vider strate gy each unit recei v es a separate clock operating at a dif ferent frequenc y whereas in DFC strate gy the same clock switches dynamically Dif ferent functional units can ha v e dif ferent maximum operating frequencies, for e xample, maximum frequenc y of multiplier has"* RAM has4"D"* logical unit has"D"* adder has"D"*, etc. A dynamic clocking unit (DCU) interprets and decodes each instruction and dri v es the processing unit at a suitable frequenc y F or a master clock at"D"*+, the output frequenc y such as"D"*,4"D"* and"*is generated using clockdi vider strate gy The speed up, obtained using dynamic frequenc y is in the range ofS'd Z "as compared to the single frequenc y operation. The authors adv ocate the use of dynamic frequenc y clocking alongwith pipelining for further impro v ement of perfomance. Krishna, Ranganathan, and V ijaykrishnan [142 143 ] propose a resource and time constrained ener gy ef cient datapath scheduling for synthesis of circuits using dynamic frequenc y clocking and multiple supply v oltages (DFMVS). The proposed scheduling scheme DFMVS has tw o main 66 PAGE 84 T able 2.5. Design and Synthesis W orks on V ariable Frequenc y or Multiple Frequenc y Proposed Design or Po wer or Operation V oltage or Result W ork Synthesis Performance Mode Frequenc y Usami, Igarashi, Design Lo wPo wer Multiple Z Z :7S0B9 D and et. al. [66 68] Synthesis V oltage (max) Usami, Igarashi, Design Lo wPo wer V ariable N A D and et. al. [67 ] V oltage (max) Ranganathan, Design High Dynamic "'d"D"* 1.793.0 and et. al. [59 60] Performance Frequenc y (times) Krishna, and Synthesis Lo wPo wer Dynamic .2A"#: Z Z :hA0B9 d! et. al. [142 143 ] (Scheduling) Frequenc y P apachristou, Synthesis Lo wPo wer Multiple N A " and et. al. [144 ] (Allocation) Frequenc y (max) Burd, Brodersen, Design Lo wPo wer V ariable d Z R9 Dq and et. al. [145 146 ] V oltage (a vg) Kim and Design Lo wPo wer Frequenc y N A N A Chae [63 ] Scaling Pouwelse, Design Lo wpo wer V ariable "#RdA"9 N A and, et. al. [122 ] Frequenc y DSdD#7* Acqua vi v a, Benini, Design Lo wpo wer V ariable N A " and Ricc o [147 ] Frequenc y (max) Benini, and et. al. Design High V ariable N A D [148 149 ] Synthesis Performance Latenc y Raghunathan, Synthesis High V ariable N A G and et. al. [150 ] Performance Latenc y No wka and Design Lo wpo wer Frequenc y "'dR9 N A [151 152 ] Scaling Lu, Benini, Design Lo wpo wer Frequenc y 4" Z d"D*+ and Michelli [153 ] Scaling (max) 67 PAGE 85 modules, such as dynamic freq sched and modify sched. The rst module generates the initial schedule in which the control steps are clock ed at dif ferent frequencies. The second schedule is a schedule modier that re groups the operations of the intial schedule such that mutiple supply v oltages can be used to reduce the ener gy consumption. The algorithm is listbased heuristic which tak es unscheduled data o w graph, number of resources with their operating frequencies, and the time constraint of the whole schedule as input. Experiments are conducted for three operating v oltages (A"9: Z Z 9:hA9). Results sho w that using three supply v oltages, an a v erage ener gy sa ving of Z has been obtained when compared to using a unifrequenc y clocking scheme with single supply v oltage. P apachristou, Nourani and Spining [144 ] propose a resource allocation technology for lo wpo wer design using multiple frequenc y The contrib ution of the paper is tw o fold. First, using nono v erlapping multiple clocking to design a partitioned datapath, so that each partition is assigned a distinct clock. F ornumber of partitions and master clock frequenc y ofr, the operating frequenc y of each partition is F . The inacti v e partitions are turnedof f during their of f duty c ycle to reduce po wer dissipation. The other contrib ution is a multiple clock allocation algorithm for po wer reduction. T w o allocation techniques are proposed. In rst scheme, called splitallocation, DFG is partitioned based on clock assignments and then each partition is synthesised separately The second allocation algorithm performs allocation in an inte grated w ay taking into account the clock assignment of DFG nodes. The adv antage of this algorithm is better sharing of the resources. Similarly the adv antage of splitallocation technique is its adaptibility with an y e xisting allocator Experimental results sho w po wer reduction with an increase in area penalty Burd, Brodersen, and et. al. [154 145 146 155 50 ] propose v ariable v oltage (frequenc y) based system for lo wpo wer and highperfomance applications. The system consists of an ARM8 core,7D<;cache and DCDC re gulator The operating v oltage of the systems is in the range of)d Z R9in [145 ] andd Z Z 9in [154 ]. The three components for implementing dynamic v oltage scaling in general purpose processor are as follo ws : a microprocessor that can operate at a wide v oltage range, a operating system that can v ary processor speed and a re gulation loop that can generate the v oltage required at a particular speed. A ne w component which needs to be added in 68 PAGE 86 the operating system is the v oltage scheduler The v oltage scheduler controls the processor speed by writing the desired clock frequenc y to a system control re gister This re gister v alue is used in the v oltagefrequenc y re gulation loop. A ring oscillator whose output frequenc y is a function of v oltage, serv es as the heart of v oltage re gulator The authors ha v e reported ener gy reduction ofDqfor MPEG benchmark and reduction ofGin ener gy for A UDIO benchmark. In [50 ], authors introduce v arious modes computation of processors, such as x ed throughput mode, maximum throughput mode and b urst throughput mode. The three k e y principles of ener gy ef cient circuit design proposed are as follo ws:3High performance is ener gy ef cient,3Clock reduction is not ener gy ef cient, and3F aster operation can limit ef cienc y Kim and Chae [63 ] propose a VLSI architecture of MPEG2 decoder using frequenc y scaling. The system clock is adjusted to lo west possible frequenc y depending on the current w orkload. The datadependent applications require less hardw are and consume less po wer than the dataindependent applications due to the use of frequenc y scaling. The system consists of four major components, such as clock controller programmable clock generator circuit status detector and synchronizer The clock controller gets the current status from the system, compares it with the required status, and changes the clock frequenc y accordingly The programmable clock generator tak es the input from the clock controller and generates appropriate frequenc y The circuit status detector guarantees the operating mar gin of the circuit from the v ariable clock frequenc y The synchronizer is used to synchronize the signals between ipops using dif ferent clocks. Pouwelse, Langendoen, and Sips [122 ] propose a v ariable frequenc y and v oltage based microprocessor system for ener gy reduction. The authors report that the ener gy consumption per instruction at lo w speed is@ =th of the ener gy required at full speed. The major components of the de v eloped system (called LAR T) include Intel StrongARM 11007S"*processor ,Z D*>;v olatile memory ,*>;nonv olatile memory and v oltage re gulator The Linux 2.4.0 operating system k ernel module is modied to change the clock frequenc y The k ernel module also adjust the 69 PAGE 87 memory parameters that control the read / write c ycles on the e xternal b us. It should be noted that the e xternal memory is not a v ailable during the frequenc y change. The minimum clock frequenc y at which the processor can operate isDSD*+at"#S9. The authors ha v e studied the performance of o v erall system, memory and applications. Acqua vi v a, Benini, and Ricc o [147 ] describe a softw arecontrolled approach for adapti v ely minimizing ener gy in embedded systems for realtime multimedia application. The softw are controller dynamically adjusts processor clock speed (supply v oltage) to the frame rate requirements of the incoming multimedia stream. The tar geted CPU is Intel StrongARM1100 processor in which twelv e frequenc y le v els are a v ailable by programming a PLL. Multimedia stream processing algorithms tak e data streams as input. The input stream which consists of frames is processed in the CPU. Let,$ 7is the a v erage switching capacitance,9f PAGE 88 time with telescopic units, and%'53. r y 0is the probability thatr yis one. The follo wing condition must be satisied for throughput impro v ement.%D53. r y 0 C ¡ ¡@? ¡(2.12) Heuristic algorithms, such as BDDbased heuristics and sumofproduct (SOP) based heuristics are proposed for synthesis of telescopic units. V arious e xperiments conducted sho wed that throughput impro v ement is obtained at the cost of area penalty Benini, Micheli, Macii, Odasso, and Poncino [149 ] propose another automatic synthesis technique formulated as time super setting problem for synthesizing telescopic units. Raghunathan, Ra vi, and Lakshinarayana [150 ] proposed highle v el synthesis methodology for synthesis of v ariable latenc y units proposed abo v e in [148 149 ]. The authors propose no v el techniques to reduce the area penalty The proposed algorithms use iterati v e approach and synthesize the circuit under resource constraints. Performance impro v ement ofGw as obtained with maximum area penalty ofqS. It has also been reported that the performance impro v ement is accompanied with po wer sa vings ofZ AD. No wka and et. al. [151 152 ] discuss a systemonachip processor using dynamic v oltage and frequenc y scaling. The v oltage or frequenc y is adaptible to change in performance demand and po wer consumption. The tar geted processor is x ed v oltage IBM Po werPC 405 core. The operating v oltage of the chip is in range" dR9. An onchip re gulator alongwith the PLL helps in continuously operating the chip e v en when the supply v oltage is modied. When the demands for resources are lo w the acti v e po wer consumption is reduced using dynamic v oltage scaling, frequenc y scaling, unit and re gister le v el functional clock gating. Both the v oltage and the frequenc y of the processor are v aried using softw are control and both acti v e and standby po wer is minimized. The processor can enter a lo wleakage sleep state and a statepreserving deepsleep state to minimize standby po wer consumption. Lu, Benini, and Michelli [153 ] discuss the ener gy reduction of interacti v e systems for mix ed w orkloads of multimedia applications using dynamic frequenc y (v oltage scaling). The proposed technique is softw arebased w orks for processors that ha v e only nite frequencies. The main idea 71 PAGE 89 is to insert b uf fers such that constant output can be maintained e v en though the input rate may be changing. The multimedia programs are di vided into into stages and data b uf fers are inserted between them. The data b uf fers support constant output rates, allo w frequenc y scaling and shorten the response times of sporadic jobs. Data are processed and stored in the b uf fers when the processor runs at a higher frequenc y Later the processor runs at a lo wer frequenc y to reduce po wer and data are tak en from the b uf fers to maintain the same output rate. Before the b uf fers become empty the processor be gins to run at a higher frequenc y again. The authors construct frequenc yassignment graphs. Each v erte x represents the current state of the b uf fers and the frequencies of the processor An ef cient graphw alk algorithm that assigns frequencies to reduce ener gy has been proposed. The timecomple xity of the algorithms are polynomial, one is 9 C and other 9 . The method reduces the po wer consumption of an MPEG program by. 2.6 Hard war e Based Digital W atermarking Systems There are se v eral image w atermarking algorithms a v ailable in current literature, which are implemented using softw are. The w atermarking schemes w ork in spatial domain, DCT domain and w a v elet domain. Ho we v er hardw are based w atermarking systems are quite fe w In this section, we discuss the hardw are based w atermarking systems. A comparati v e vie w of the proposed w ater marking chips is gi v en in T able 2.6. Stryck er T ermont, V ande we ge, Haitsma, Kalk er Maes and Depo v ere [158 ] propose a realtime w atermarking scheme for tele vision broadcast monitoring. The y address the implementation of a realtime w atermark embedder and detector on a T rimedia TM1000 VLIW processor de v eloped by Philips semiconductors. The w atermark is in spatial domain. In the insertion procedure, pseudorandom numbers are added to the incoming video stream. The depth of w atermark insertion depends on the luminance v alue of each frame. The w atermark detection is based on the calculation of correlation v alues. Mathai, K undur and Sheikholeslami [159 ] present hardw are implementation of the same video w atermarking algorithm. The chip is implemented using"#7R'technology The authors did not pro vide an y lay out details for the proposed hardw are and did not mention its po wer consumption and operating frequenc y 72 PAGE 90 T able 2.6. W atermarking Chips Proposed in Current Literature Proposed T ype of T ar get W orking T echnoChip Chip Po wer W ork W atermark Object Domain logy Area Consumption Mathai and In visible V ideo W a v elet "#7R' N A N A et. al. [159 ] Rob ust Tsai and Lu In visible Image DCT "# Z Z "YG Z " DAR [160 ] Rob ust e C Z Z 9:h"* Garimella and In visible Image Spatial "# Z Z Z G Z Z Z ' et. al. [161 ] Fragile ' C 9 A DCT domain in visible w atermarking chip is presented by Tsai and Lu [160 ]. The w atermark systems embeds a pseudorandom sequence of real numbers in a selected set of DCT coef cients. The y also proposed a JPEG architecture incorporating the w atermarking module in it. The w ater mark is e xtracted without resorting to the original image. The authors claim that the w atermark is resistant to the JPEG attacks upto4"compression ratio. The w atermark chip is implemented using TSMC"# Z 'technology and occupies a die size ofZ "G Z "De Cfor Z gates. The chip consumesDAR po wer when operated at"* withZ Z 9supply v oltage. Garimella, Satyanarayan, K umar Murugesh and Niranjan [161 ] propose an w atermarking VLSI arcitecture for in visiblefragile w atermarking in spatial domain. In this scheme, the differential error is encrypted and interlea v ed along the rst sample. The w atermark can be e xtracted by accumulating the consecuti v e LSBs of pix els and then decrypting. The e xtracted w atermark is then compared with the original w atermark for image authentication. The ASIC is implemented using"# Z 'technology The area of the chip isZ Z G Z Z ' Cand consumesZ ' po wer when operated at9. The critical path delay of the circuit isARDS. 2.7 This Dissertation The synthesis techniques discussed in Sections 2.1 and 2.2 are based on a single clock frequenc y and consider multiple supply v oltages, v oltage scaling, capacitance reduction, and switching acti vity reduction to minimize total ener gy or a v erage po wer Ho we v er not both at the same time. Further these w orks ha v e not considered dynamic frequenc y clocking or transient po wer 73 PAGE 91 reduction. The w orks in Section 2.3 address only peak po wer issues and do not include ener gy minimization or transient po wer It it e vident from Section 2.4 and Section 2.5 that v oltage scaling or frequenc y is an ef fecti v e method for po wer reduction and performance impro v ement. In this disseration, we propose scheduling techniques to minimize total ener gy (or a v erage po wer). W e also propose scheduling techinques for peak po wer and transient po wer reduction. Beha vioral synthesis frame w orks are proposed for reduction simultaneous reduction of ener gy a v erage po wer peak po wer and transient po wer A ne w parameter called Cycle P ower Function (CPF) is dened which is an equally weighted sum of normalized mean c ycle po wer and normalized mean c ycle dif ferential po wer Minimizing this parameter using multiple supply v oltages (MV), dynamic frequenc y clocking (DFC) and multic ycling results in the reduction of both ener gy and transient po wer Both ILP and heuristics based approaches ha v e been in v estigated. In Section 2.6, we ha v e discussed the fe w w atermarking hardw are systems a v ailable. In this dissertation we introduce fe w VLSI implementations of e xisting w atermarking algorithms. W e intend to use multiple supply v oltage and v ariable frequnec y in the w atermarking chip design. 74 PAGE 92 CHAPTER 3 ENERGY MINIMIZA TION Dynamic frequenc y scaling has been e xplored at the CPU and system le v els for po wer optimization. In this chapter we discuss datapath scheduling algorithms that use multiple supply v oltages and dynamic clocking in a coordinated manner in order to reduce ener gy and ener gy delay product [54 55 ]. The strate gy is to schedule high ener gy units, such as the multipliers at lo wer frequencies so that the y can be operated at lo wer v oltages to reduce ener gy consumption and the lo w ener gy units, such as adders at higher frequencies, to compensate for speed. The proposed heuristic based time and resource constrained algorithms ha v e been applied to v arious high le v el synthesis benchmark circuits under dif ferent time and resource constraints. This chapter is or ganised as follo ws. Section 3.1 discusses the tar get architecture model and frequenc y selection scheme. Section 3.2 and 3.3 present the time constrained scheduling (TCDFC) and the resource constrained scheduling (RCDFC) algorithms follo wed by results and conclusions. 3.1 T ar get Ar chitectur e and Datapath Specifications The tar get architecture model assumed in the design of the scheduling schemes is sho wn in Fig. 3.1. All functional units ha v e one re gister each and one multiple xor Each functional unit feeds into a single re gister The re gister and the multiple xor operate at the same v oltage le v el as that of the functional units. Le v el con v erters are used when a lo wv oltage functional unit is dri ving a highv oltage functional unit [65 95 ]. A controller decides which functional units are acti v e in each control step and those that are not acti v e are disabled using the multiple xors. The controller has a storage unit to store the parameters¤ ru obtained from the scheduling. The c ycle frequenc yr (= ¦§B¨c m ) is generated dynamically and a functional unit operating at one of the supply v oltages is acti v ated. 75 PAGE 93 Level Converter Converter Level No FU, 3.0V FU, 5.0V FU, 2.4V Figure 3.1. Le v el Con v erters Needed for Stepping up Signal The datapath is specied as a sequencing data o w graph (DFG) [21 ]. Each v erte x of the DFG represents an operation and each edge represents a datao w (or dependenc y). The DFG does not support the hierarchical entities. The conditional statements are handles using comparison operation. Since, the dynamic frequenc y clocking scheme is useful only in the case of signal processing applications, we assume that the abo v e does not e xist in the directed ac yclic DFG representation of datapaths. Each v erte x has attrib utes that species the operation type such as addition, subtraction, multiplication or null opeations (NOPs). The delay of a control step is dependent on the delays of the functional unit and the multiple x er and re gister pair Let,6 { be the delay of the re gister ,6AlBA7be the delay of the multiple xor ,6 Abe the delay of the functional unit and6 PAGE 94 * * + v3 + < NOP v12 v0 NOP Source Sink c = 0 c = 3 c = 2 c = 1 c = 5 c = 4 Cycles 1 1 1 1 v10 v8 v2 v1 v6 1 v7 v9 v11 v4 v5 2 2 2 2 3 4 Figure 3.2. HAL Dif ferential Equation Solv er (with ASAP labels) are, NM my .c¤ ru q0, NMPO zf.c¤ ru e 0, QM n 5 .c¤ ru "0,* M mhy .c¤ ru 0,* M O xf.c¤ ru 0and* M n 5 !.c¤ ru =R0. F or e xample, if the base frequenc y fed to the DCU isZ D*, then the frequencies generated are,DRD*,SD*+andD*+. The clock frequenc y for a gi v en control step is the minimum of the operating frequencies of all FUs acti v e in that step. 3.2 T ime Constrained Scheduling The datapath is represented in the form of a data o w graph (DFG) constructed as a sequencing graph. Fig. 3.2 sho ws such a graph for the HAL benchmark. The inputs to the algorithm are an unscheduled data o w graph (UDFG), the scaled do wn operating frequencies, and the e x ecution time constraint for the whole schedule. T o get more ener gy sa vings and at the same time maintain performance, the multipliers are to be operated at as lo w frequencies as possible and the adders at as high frequencies as possible. This objecti v e can be achie v ed if adders / subtractors are not operated alongwith multipliers in the same duty c ycle. In cases, when the y are to be operated during the same c ycle to meet the time constraint, ener gy sa vings will come from the multipliers only Initially TCDFC generates a schedule such that the lo w frequenc y operators are scheduled at earlier steps and the high frequenc y operators are scheduled at later steps. Later on, the TCDFC modies the schedule by mo ving operations from one step to another with the objecti v e of meeting the time constraint. It then nds appropriate clock c ycle width and assigns appropriate v oltage. 77 PAGE 95 Step 1 : Find an ASAP schedule for the sequencing UDFG. Step 2 : Create a priority list of v ertices using the ASAP schedule in Step 1. Step 3 : Assign control steps to the operations such that the higher priority v erte x scheduled at earlier time stamp, precedence is satised, and the multiplications and ALU operations are not scheduled in the same c ycle. Step 4 : Find the c ycles ha ving only ALU operations and, those with only multiplications, and those with both ALU operations and multiplications (mix ed) for the currently obtained schedule. Step 5 : Create a priority list of clock c ycles such that c ycles with only ALU operations get higher priority than the c ycles with only multiplications or those with mix ed operations (c ycles with only multiplications get higher priority than the c ycles with mix ed operations). Step 6 : Initialise c ycle frequenc y to the minimum operating frequenc y Step 7 : If time constraint is not satised, the highest priority c ycle is assigned the ne xt higher frequenc y and repeat the step for the ne xt higher priority c ycle if necessary Step 8 : If an y c ycle has multiplier operating at highest frequenc y then eliminate the c ycle ha ving minimum number of ALU operations, adjust the schedule and go to Step 4. Step 9 : Do v oltage assignment and determine ener gy details. Step 10 : Find the c ycle frequenc y inde x for each c ycle. Figure 3.3. TCDFC Scheduling Algorithm Flo w 3.2.1 Algorithm Flo w Fig. 3.3 sho ws the o w of the proposed TCDFC scheduling algorithm. In step 1, an ASAP schedule for the data o w graph (DFG) is determined. In step 2, the scheduler creates a priority list of the v ertices such that all multiplications (i.e lo w frequenc y operators) are grouped with higher priority than the ALU operations (i.e. high frequenc y operators, such as additions, subtractions, comparisons, etc.). Among the multiplication operations higher priority is gi v en to the operations with smaller ASAP time stamp, same is done for the group of ALU operations. In step 3, the v ertices are time stamped such that no multiplication and ALU operations scheduled to function concurrently In addition, it is made sure that operation precedence is satised and higher priority v erte x scheduled at earlier time stamp. In step 4, for the current schedule, the c ycles are cate gorised as, c ycles ha ving only ALU operations, only multiplication and both ALU operations and multiplication (mix ed operations). In step 5, priority list of clock c ycles created such that c ycles with only ALU operations get higher priority than c ycles with only multiplications or mix ed operations. The c ycles with only multiplications get higher priority than the c ycles with mix ed operations. 78 PAGE 96 Further among the c ycles with only ALU (or multiplication) operations higher priority is gi v en to the c ycle ha ving lesser number of ALU (or multiplication) operations. Similarly among the c ycles with mix ed operations higher priority is gi v en to c ycles ha ving lesser number of multiplications. In step 6, initial c ycle frequenc y is tak en as minimum operating frequenc y with the help of T able 3.3. In step 7, in order to full time constraint, the highest priority c ycle frequenc y is increased using T able 3.3. If needed the process is repeated for the ne xt higher priority c ycle. In step 8, if it is found that a c ycle with multiplication is highest v oltage then the c ycle ha ving minimum number of ALU operations is eliminated and the schedule is adjusted. In step 9, v oltage assignment is done and ener gy estimates for entire DFG is found out. In step 10, the c ycle frequenc y inde x for each c ycle is found out. The pseudocode for the algorithm is gi v en in Fig. 3.4. T able 3.1. List of Functions used in the TCDFC Algorithm Functions Description Comple xity ASAPSc heduler : Determines the ASAP time of the v ertices.R .<9 4OW ; 0 CreateV erte xPriorityList : Creates a priority list of v ertices such thatR.<9Y0 the v erte x with lo wer operating frequenc y gets the higher priority T OP : Finds the rst v erte x from priority list array .R.}q0 Chec kF r equencyConstr aint : Checks the frequenc y constriant in a c ycle.R.}q0 Maximum : Finds the maximum v alue from an array .R .c¤70 Cr eateCyclePriorityList : Constructs the c ycle priority list in an array .R .c¤70 F indMinimumF r equency : Finds the minimum a v ailable frequenc y .R 0 CalculateDelay : Calculates the critical path delay .R .c¤70 F indNe xtHigherF r equency : Finds the ne xt higher a v ailable frequenc y .. 0 F indCycleW ithMinimumALU : Finds the control step with minimumR.c¤ ¡ 0 number of ALU operations. Adjust Predecessor : Adjusts time stamp of predecessor.9Y0 Adjust Successor : Adjusts time stamp of successor.h9 0 Update CyclePriorityList : Updates the array ..2¤70 V oltage Assignment : Assigns v oltage to each v erte x.R .<9Y0 Find Cycle Frequenc y Inde x : Finds c ycles frequenc y indices of all c ycles.R.c¤40 79 PAGE 97 T able 3.2. List of V ariables and Data Structures used in the TCDFC Algorithm Description Data Structures Descriptions ASAPSc hedule : An array used to store ASAP time stamp of each v erte x. TCDFCSc hedStep : An array used to store TCDFC time stamp of each v erte x. Sc heduledV erte xList : An array used to store v ertices already scheduled. V erte xPriorityList : An array used to store v ertices in a priority order CyclePriorityList : An array used to store control steps in a priority order TCDFCNoOfSteps : T otal number of control steps of TCDFC schedule. CycleF r equencyList : An array used to store frequenc y of each c ycle. c ycle, ControlStepIndicator : T emporary v ariables. 3.2.2 Pseudocode Description The list of functions needed in implementation of the algorithm is gi v en in T able 3.1. Similarly the data structures or the identiers used in the algorithm description is summarized in T able 3.2. The pseudocode of the algorithm is gi v en in Fig. 3.4. T able 3.3. TCDFC Freqeunc y Selection : from leftright Mn n 5 M O xf QMO xf QM mhy Frequenc y D*+ SD* 7RD* Z D* ¤ ru 8 4 2 1 T able 3.4. V erte x Priority List v0 v1 v2 v6 v8 v3 v7 v10 v9 v11 v4 v5 v12 0 1 2 3 4 5 6 7 8 9 10 11 12 In line 01, the ASAP schedule for the UDFG is found out. The procedure Cr eateV erte xPrior ityList creates the V erte xPriorityList such that the v erte x with the lo wer operating frequenc y gets the higher priority to be scheduled at earlier a control step than the lo wer priority v ertices. T able 3.4 sho ws such an list obtained for the DFG gi v en in Fig. 3.2. TCDFCSc hedSteps (line 02) is a data structure that contains the clock c ycle step for an y v erte x>#m. It is initialized to zero for the source v erte x. Sc heduledV erte xList (line 02) is a data structure to maintain the list of v er tices already scheduled which is initialised to the source v erte x. The while loop (line 03) tak es the highest priority v erte x each time (line 04) and schedules it in an appropriate c ycle checking 80 PAGE 98 TCDFCAlgorithm(UDFG, Operating Frequenc y)J(01) ASAPScheduler(UDFG); CreateV erte xPriorityList(ASAPSchedule); c ycle = 1; (02) TCDFCSchedStepsS= 0; ScheduledV erte xList =>UT; // source v erte x scheduled (03) while(V erte xPriorityListV NULL)J(04)>m= T OP(V erte xPriorityList); (05) if(>DmXW 1ScheduledV erte xList and AllPredecessor 1ScheduledV erte xList)J(06) if(CheckFrequenc yConstrain t(cycle )) then c ycle = Maximum (TCDFCSchedSteps)O1; (07) else schdule in current c ycle; (08) TCDFCSchedSteps = c ycle; V erte xPriorityList = V erte xPriorityListd>Am; (09) ScheduledV erte xList = ScheduledV erte xListY">m;_// end if (05)_// end while (03) (10) TCDFCNoOfSteps = Maximum(TCDFCSchedSteps); (11) CreateCyclePriorityList(CurrentSch ed ule TCDFCNoOfSteps); (12) CycleFrequenc yList = FindMinimumFrequenc y(T able 3.3); (13)gw= CalculateDelay(CycleFreque nc y List); ControlStepIndicator = 1; (14) while (ControlStepIndicator)J(15) while (w X )J(16)¤£m= T OP(CyclePriorityList); CycleFrequenc yList = FindNe xtHigherFrequenc y(T able 3.3); (17)gw= CalculateDelay(CycleFreque nc y List);_// end while (15) (18) if (no multiplier is operating at highest frequenc y) then ControlStepIndicator = 0; (19) elseJ(20)¤£m= FindCycleW ithMinimumALU(for all c ycle¤7m); (21) for each> m 1¤ mdo reduce time stamp of> mand adjust Predecessor and Successor (22) CycleFrequenc yList = FindMinimumFrequenc y(T able 3.3); (23)gw= CalculateDelay(CycleFreque nc y List); Update CyclePriorityList; (24)_// end else (19)_// end while (14) (25) Do v oltage assignment ; Find c ycle frequenc y inde x ;_// End Algorithm TCDFC Figure 3.4. Pseudocode for TCDFC Scheduling Algorithm 81 PAGE 99 for the frequenc y constraint violation pro vided all of its predecessors are already scheduled. The function Chec kF r equencyConstr aint (line 06) helps in checking the frequenc y constraint. This assures that tw o v ertices operating at dif ferent frequencies are not scheduled during the same c ycle. TCDFCNoOfSteps (line 10) is the number of control steps for the schedule already generated. Procedure Cr eateCyclePriorityList (line 11) creates the CyclePriorityList in which the higher priority c ycles will be assigned higher frequencies. T able 3.5 sho ws such a list obtained for the schedule generated in using lines 0109. The data structure CycleF r equencyList (line 12) is used to store the operating frequenc y of each c ycle. Initially each c ycle is assigned the minimum frequenc y from T able 3.3, and the critical delay of the schedule is found (line 12). While the time constraint is not satised, with the help of CyclePriorityList appropriate clock c ycles is assigned to the ne xt higher frequenc y and check ed if time constraint is satised (line 1424). The algorithm terminates if no c ycle has multiplier scheduled operating at highest frequenc y (line 18). Otherwise, the c ycle ha ving minimum number of ALU is eliminated (line 20) and CyclePriorityList is updated, and lines 1424 are repeated. T able 3.6 sho ws an updated CyclePriorityList Finally proper v oltage v alue are assigned to the v ertices. The algorithm also calculates the ener gy v alue of the schedule. Algorithm nds the c ycle frequenc y inde x using CycleF r equencyList The nal scheduled datapath is sho wn in Figs. 3.5(a), 3.5(b) and 3.5(c) for dif ferent time constraints. T able 3.5. Cycle Priority List : N r r Cycles c5 c4 c3 c2 c1 c6 c0 Priorities 0 1 2 3 4 5 6 T able 3.6. Cycle Priority List : = r Cycles c4 c3 c2 c1 c5 c0 Priorities 0 1 2 3 4 5 3.2.3 T ime Complexity Let there be9Ynumber of v ertices and ; number of edges in the DFG. Suppose the number of control steps found out from the ASAP scheduling is¤. Let denote the number of frequenc y 82 PAGE 100 * * v7 v3NOPv12 v0NOP* v1 v2 v9 + + v10 v4 v5 v6 v8 c = 1 c = 2 c = 3cfi = 1 cfi = 1c = 4 c = 5cfi = 1c = 6 c c c c c c = 0Sink Source Cycles 5.0 Vv115.0 V<5.0 V 5.0 V 5.0 V 2.4 V 2.4 V 2.4 V 2.4 V 2.4 V 2.4 V cfi = 8 cfi = 8 (a) T ime Constrained :Z K[]\^ _a` Z * v7 v3NOPv12 v0NOP* v1 v2 v9 + + v10 v4 v5 v6 v8 c = 1 c = 2 c = 3cfi = 1 cfi = 1c = 4 c = 5cfi = 1c = 6 c c c c c c = 0Sink Source Cycles 5.0 Vv115.0 V<5.0 V 5.0 V 5.0 V 2.4 V 2.4 V 2.4 V 2.4 V cfi = 4 3.3 V 3.3 V cfi = 8 (b) T ime Constrained :Z @[bc^ dcea` Z c = 0 * v7 v3 v0 NOP Source v1 v2 + v10 v6 v8 c = 1 c = 2 c = 3 cfi = 1 c = 4 cfi = 1 c c c c < v4 Cycles v11 5.0 V v5 5.0 V 2.4 V 2.4 V 2.4 V 2.4 V cfi = 4 3.3 V 3.3 V 5.0 V 5.0 V c = 5 NOP v12 Sink cfi = 8 v9 + 3.3 V (c) T ime Constrained :Z [bc^ eG` Z Figure 3.5. Schedules Obtained for HAL Benchmark for Dif ferent T ime Constraints using TCDFC 83 PAGE 101 le v els and ¡denote the number of resource types. Based on the time comple xity of the dif ferent functions gi v en in T able 3.1, we pro vide the follo wing analysis for the w orstecase running time of the TCDFC algorithm. T ime tak en by the instruction from line 0102 isR .<9 4OW ; 0OfR .<9 0. The running time of the codese gment line 0309 isR .c¤9 0. Similarly ,R .c¤70OgR 0is the running time of the code se gment line 1013. Assuming the while loops are e x ecuted for constant number of time (independent of the input size9Yor ; ), the time comple xity of the code se gment line 1425 isR .c¤ ¡0OgR .<9 0OgR 0OgR .2¤70. W ithout loss of generality we can assume that the ¡: and¤are upper bounded by the number of v ertices9 . Using this assumption the o v erall running time of the algorithm is e xpressed as :R .<9 7OW ; 0OhR .<9 49Y0. F or strongly datadependenc y we ha v e ; t 9 Cand for weak datadependenc y ; ' 9 C. In either case, the simplied timecomple xity of the TCDFC sc heduling algorithm is9Y C, meaning the timecomple xity is polynomial to the number of vertices (oper ations) in the data ow gr aph. 3.3 Resour ce Constrained Scheduling The objecti v e of RCDFC is to minimize the ener gydelayprod uct while assigning a schedule for the DFG. F or a resourceoperating in clock step¤, let, (i)mb be the switching, (ii)$tmb be the load capacitance and (iii)9vmb be the operating v oltage. If a le v el con v erter is needed, it is considered as a resource needed in the particular clock c ycle in which it needs to step up the signal. Ifpis the total number of clock c ycles for the DFG,p is the number of resources acti v e in c ycle¤, andr is the c ycle frequenc y then, the total ener gy consumption of the DFG is gi v en by Eqn. 3.2.; ji @ ji : m@ gmb $mb 9 C mb (3.2) The ener gydelayprod uc t.c;^ %10is characterised by Eqn. 3.3.;1Y% ; T ji @ ki : mE@ m $mb 9 C m ki @ @ (3.3) The objecti v e of RCDFC is to minimize the;^ %gi v en as equation 3.3. RCDFC attempts to operate the multipliers at as lo w frequenc y as possible, the resulting decrease in per 84 PAGE 102 T able 3.7. Frequenc y Selection (From Left to Right in Each Step) FUs in a c ycle Frequenc y priority order MUL T M n 5 :h* Mn O zfD:h* M mhy MUL T ALU M n 5 :< QM n 5 :h* M mhy ALU NM mhy :< QMO xfD:< NM n 5 T able 3.8. Resource Lookup T able (order From Left to Right) Clock MUL T ALU Cycle 2.4 V 3.3 V 5.0 V 5.0 V 3.3 V 2.4 V c 1 2 1 1 1 0 formance is compensated by operating the ALUs at as high frequenc y as possible. Depending on which functional units are acti v e in a gi v en c ycle, the algorithm determines the frequenc y using a lookup table (LUT), called frequenc y selection LUT, such as the one sho wn in T able 3.7 scanning it left to right. In a schedule, if only multipliers are needed in a particular c ycle the frequenc y selection is in the order* M n 5 :h* Mn O zfD:h* M mhy. If both multipliers and the ALUs are all operating in a gi v en clock c ycle, the frequenc y selection is in the order* M n 5 :< QM n 5 T:h* M my. If only ALUs are operating in a control step, then the frequenc y selection is in the order NM mhy :< NMO xfD:< NM n 5 . Another lookup table called resource assignment LUT constructed considering the resource constraints is used to match the selected frequenc y with a corresponding v oltage le v el. The resources are assigned scanning the LUT from left to right. The scheduling algorithm uses heuristics to minimize the number of times le v el con v ersions needed. An e xample resource assignment LUT is sho wn in T able 3.8 with resource constraints: one MUL T atA9, tw o MUL T atZ Z 9, one MUL T atA"9, one ALU atZ Z 9and one ALU atA"9. The dimension of this LUT depends on the total number of clock c ycles of the schedule and the number of resource types. It should be noted that the arrangement of the MUL Ts is in the order from lo w to high v oltage, whereas for the ALUs it is from high to lo w The LUT is updated during each assignment to mak e sure that the resourceconstraints are not violated. 85 PAGE 103 Step 1 : Deri v e ASAP and ALAP schedules for the unscheduled DFG. Step 2 : Determine the number of resources at dif ferent operating v oltages. Step 3 : Using abo v e number of resources modify the schedules obtained in Step 1. Step 4 : Calculate the total number of control steps which is the lar ger those of ASAP and ALAP schedules from Step 3. Step 5 : Construct the resource assignment LUT and frequenc y selection LUT. Step 6 : Find the v ertices ha ving nonzero mobility and v ertices with zero mobility and assume ASAP schedule in Step 3 as the current schedule. Step 7 : Do v oltage and frequenc y assignment using the current schedule and the LUTs. Step 8 : T aking a v erte x with nonzero mobility time stamp it using LUTs such that ener gy delay product of the e x ecution of whole DFG is minimum. Step 9 : Adjust current schedule, predecessor and successor time stamps, LUTs, and repeat Steps 7 and 8 to time stamp remaining nonzero mobility v ertices. Step 10 : Determine the clock frequenc y inde x for each c ycle. Figure 3.6. RCDFC Scheduling Algorithm Flo w 3.3.1 Algorithm Flo w Fig. 3.6 sho ws the o w of the proposed algorithm. The data o w graph is modeled as a sequencing graph [21 ]. The inputs to the algorithm are an unscheduled data o w graph (UDFG), the resource constraints which include the number of resources, their corresponding operating v oltages and the scaled do wn operating frequencies. In step 1, the scheduler determines the ASAP and the ALAP schedules for the UDFG. In step 2, the total number of resources is found out as the sum of each resource at dif ferent v oltage le v els. In step 3, the ASAP and ALAP schedules of step 1 are modied using the number of resources found in step 2. In step 4, the total number of control steps for both ASAP and ALAP schedule are found out and the number of control steps for the nal steps is assumed to be the maximum of the tw o. In step 5, the resource assignment LUT and frequenc y selection LUT are constructed. In step 6, the v ertices ha ving nonzero mobility and the v ertices with zero mobility are found out and the current schedule is initialized as the ASAP schedule obtained in step 3. In step 7, v oltage and frequenc y assignments are made for the current schedule using the LUTs. In step 8, the scheduler nds a proper step for each v erte x ha ving nonzero mobility such that the number of le v el con v erters needed for the e x ecuction of the whole DFG is minimum. As long as the v oltage and frequenc y assignments follo w the LUTs order ener gy consumption is k ept to a minimum. In step 9, current schedule, LUTs are adjusted to satisfy the 86 PAGE 104 T able 3.9. List of Functions used in the RCDFC Algorithm Functions Description Comple xity ASAPSc heduler : Determines ASAP time of the v ertices.R .<9 4OW ; 0 ALAPSc heduler : Determines ALAP time of the v ertices.R .<9 4OW ; 0 ModifySc hedule : Modies the unconstrained schedules toR .<9 4OW ; 0 incorporate resorce constraints. ConstructResAssignmentT able : Constructs resource assignment LUT .R.c¤ ¡i0 Maximum : T o nd maximum of to control steps.R .}q0 F indResT ypeF orEac hV erte x : Identies the FU needed for each v erte x.R .<9 0 ConstructF r eqSelectionLUT : Constructs frequenc y selection LUT .R 0 F indMobileV erte xList : Finds the mobility of each v erte x.R .<9 0 AllocateV oltAndF r eq : Allocates the v oltage and frequenc y le v elsR .c¤9 ¡i0 using LUTs and current schedule. CalculateEDP : Calculates the EDP of the whole DFG.R.<9 0 AdjustSc hedule : Adjusts the predessor and successor time.<9 0 stamps such that the precedence is satised. Update Res Assignment LUT : Updates resource assignment LUT .R.}q0 F indEner gyAndDelay : Determines ener gy and delay .R .<9 0 F indCycleF r eqInde x : Finds c ycles frequenc y indices.R .c¤70 precedence. In step 10, c ycle frequenc y indices are found for all c ycles which w ould be stored in the controller and w ould be fed to the DCU for dynamic frequenc y generation. The algorithm terminates once all nonzero mobility v ertices are scheduled. 3.3.2 Pseudocode of the Resour ce Constrained Algorithm The list of functions needed in implementation of the algorithm is gi v en in T able 3.9. Similarly the data structures or the identiers used in the algorithm description is summarized in T able 3.10. The pseudocode of the algorithm is gi v en in Fig. 3.7. The inputs to the algorithm are the unscheduled data o w graph (UDFG) and resource constraints which includes number and type of each functional units, the operating v oltage le v els and the operating frequencies. The procedures in line 01, ASAPSc heduler and ALAPSc heduler nd the unconstrained ASAP and ALAP schedules for the UDFG respecti v ely In line 02, the total number of multiplier and ALU FUs with dif ferent v oltage le v els is determined. F or e xample, if the resource constraint is 2 ALUs atA9, 1 ALU atZ Z 9, 1 multiplier atA9, and 3 multipliers atA"9, then 87 PAGE 105 RCDFCAlgorithm(UDFG, FUs, V oltage Le v els, Operating Frequencies)J(01)ASAPScheduler(UDFG); ALAPScheduler(UDFG); (02)MUL T =Multipliers of dif ferent v oltage le v els; ALU =ALUs of dif ferent v oltage le v els; (03)ModifySchedule(ASAPSchedule, MUL T ALU); ModifySchedule(ALAPSchedule, MUL T ALU); (04)NoOfControlSteps = Maximum(ASAPControlSteps, ALAPControlSteps); (05)ConstructResAssignment LUT(NoOfControlSte ps, FUs); (06)FindResT ypeF orEachV erte x(UDFG); ConstructFreqSelectionLUT(Operati ng Frequenc y); (07)FindMobileV erte xList(ASAPSchedule, ALAPSchedule); CurrentSchedule = ASAPSchedule; (08)while(NonZeroMobilityV ertexList is NO T empty)J(09) max =dml; AllocateV oltAndFreq(CurrentSch ed ule LUTs); (10) CurrentEDP = CalculateEDP (V oltageArray ,Frequenc yArra y); (11) for each>DmG1NonZeroMobilityV erte xListJ(12) start = CurrentSchedule[>m]; end = ALAPSchedule[>m]; (13) for c ycle = startend in steps of 1J(14) T empSchedule = AdjustSchedule(CurrentSche du le,>Am, c ycle); (15) AllocateV oltAndFreq(T empSchedule LUTs); (16) T empEDP = CalculateEDP(V oltageArray ,Frequenc yArra y) ; (16) ExtraEDP = CurrentEDPdT empEDP; (17) if(ExtraEDPXmax)J(18) max = ExtraEDP; CurrentV erte x =>m; (19) CurrentCycle = c ycle;_// end if (17)_// end for (13)_// end for (11) (20) CurrentSchedule = AdjustSchedule(CurrentSche du le, CurrentV erte x, Currentc ycle); (21) Update the resource assignment LUT; (22) ZeroMobilityV erte xList = ZeroMobilityV erte xListYCurrentV erte x; (23) NonZeroMobilityV erte xList = NonZeroMobilityV erte xListdCurrentV erte x;_//end while(08) (24)AllocateV oltAndFreq(Curre ntSch ed ule LUTs); (25)Ener gyAndDelayDetails( V olt age Array Frequenc yArray); FindCycleFreqInde x(Frequen c yArra y);_// End Algorithm RCDFC Figure 3.7. Pseudocode for RCDFC Scheduler 88 PAGE 106 T able 3.10. List of V ariables and Data Structures used in the RCDFC Algorithm Description Data Structures Descriptions ASAPSc hedule : An array used to store ASAP time stamp of each v erte x. ALAPSc hedule : An array used to store ALAP time stamp of each v erte x. Curr entSc hedule : An array used to store current schedule time stamp. T empSc hedule : An array used to store temporary schedule time stamp. MUL T : Number of multipliers at all v oltage le v els. ALU : Number of ALUs at all v oltage le v els. ASAPContr olSteps : T otal number of control steps of ASAP schedule. ALAPContr olSteps : T otal number of control steps of ALAP schedule. NoOfContr olSteps : Number of control steps of the schedule. ResAssignmentLUT : Resource assignment lookup table. F r eqSelectionLUT : Frequenc y selection lookup table. max, start, end, c ycle : T emporary v ariables. CurrentEDP T empEDP ExtraEDP : T emporary v ariables. CurrentV erte x, CurrentCycle : T emporary v ariables. V olta g eArr ay : An array used to store operating v oltage for each v erte x. F r equencyArr ay : An array used to store operating fequenc y for each c ycle Zer oMobilityV erte xList : An array storing the v ertices with zero mobility NonZer oMobilityV erte xList : An array storing the v ertices with nonzero mobility the number of ALUs is 3 and the number of multipliers is 4. Using the number of multipliers and ALUs found abo v e as initial resource constraint (with relax ed v oltage constraint), the ModifySc hedule procedure (line 03) modies the ASAP and ALAP schedules so that the resource constraints are not violated. In this process, the mobility of the v ertices are restricted to great e xtent and the search space for the follo wing steps reduces. Ne xt, the total number of c ycles for the schedule is assumed as the maximum of the number of c ycles for the ASAP and ALAP schedules (line 04). The resource assignment LUT is constructed (similar to T able 3.8) in line 05 whose size depends on (NoOfContr olSteps NoOfResour ceT ypes) The procedure F indResT ypeF orEac hV erte x (line 06) identies the functional unit(s) required at each v erte x of the DFG. In line 06, frequenc y selection LUT similar to T able 3.7 is constructed. The F indMobileV erte xList procedure (line 07) tak es as input the modied ASAP and the modied ALAP schedules (line 04) to determine tw o lists: the list, Zer oMobilityV erte xList containing the v ertices with zero mobility (same ASAP and ALAP 89 PAGE 107 time stamps) and another NonZer oMobilityV erte xList containing the nonzero mobility v ertices (dif ferent ASAP and ALAP time stamps). In line 07, the Curr entSc hedule is initialized as the modied ASAP schedule (obtained in line 03). The procedure AllocateV oltAndF r eq (lines 09 and 24) allocates the v oltage le v els and frequenc y le v els to the FU' s using the LUTs and the current schedule. This procedure returns tw o lists: one containing the assigned v oltage of each v erte x (V olta g eArr ay) and the other (F r equencyArr ay) containing the selected frequenc y F r equencyArr ay is in turn used to deri v e the¤ ru for the control steps. The procedure CalculateEDP (line 10) the ener gy delay product of the whole DFG using a schedule with v oltage assignment stored in V olta g eArr ay and frequenc y contained in F r equencyArr ay The procedure AdjustSc hedule (lines 14 and 20) schedules each v erte x to a specic c ycle while adjusting its predecessor and successor time stamps. The for loop (lines 11 to 19) considers all the v ertices from the NonZer oMobilityV erte xList and nds a suitable v erte x and its time stamp such that the ener gy delay product of the whole DFG with current schedule is minimum. In line 21, resource assignment LUT is updated. The while loop (lines 08 to 23) terminates when all the v ertices with nonzero mobility ha v e been assigned the proper time stamp. The procedure F indEner gyAndDelay (line 25) determines the ener gy consumption and e x ecution time for the schedule. Line 25, F indCycleF r eqInde x nds c ycles frequenc y indices of all c ycles which is going to help in dynamic frequenc y generation. Figure 3.8 is obtained after e x ecuting the RCDFC algorithm for the resource constraint (one MUL T atA9, one MUL TZ Z 9, one ALU atZ Z 9and one ALU atA"9). 3.3.3 T ime Complexity Let there be9 number of v ertices and ; number edges in the DFG, out of which9gl)number of v ertices ha v e mobility and the maximum mobility of an y mobile v erte x isl. Let denote the number of v oltage le v els and denote the number of frequenc y le v els. Suppose the number of control steps found out from the ASAP scheduling is¤. Assuming that and are upper bounded by9 , the running time of the code se gment from line 0107 isR .<9YqOW ; 0 OnR .c¤ ¡i0. The timecomple xity of the instruction in line 1119 isR .c¤9 ¡ 9l}zl0. 90 PAGE 108 c = 2ccfi = 8 cfi = 2 c = 1cc = 0 Cycles c = 3ccfi = 8 c = 11 cfi = 1cc = 10ccfi = 1 c = 9 cfi = 1cc = 8 cfi = 1cc = 7 cfi = 1ccfi = 1 c = 6cc = 5ccfi = 8 c = 4ccfi = 8 v9 v7 v6 2.4V 3.3V 3.3V 3.3V 5.0V v5 + * * 2.4V v15 3.3V v14 v13 3.3V v12 3.3V 2.4V v11 3.3V 5.0V v3 v10 3.3V v2 v8 v18 5.0V v17 + v20 v21 5.0V v23 c = 12 v24 Sink NOP v22 5.0V 5.0V 5.0V 5.0V + + + + v19 + 5.0V v4 v1 Source v0 NOP + + + + + 5.0V + + v16 + 5.0V 2.4V Figure 3.8. Final Schedule of FIR Filter DFG (using RCDFC) The codese gment line 09 to 19 has running timeR .c¤9 ¡ 96lxxl0uOoR.<9Y0vOnR .c¤9 ¡0 +R .c¤9Y ¡ 96lxxl0. The running time of the code se gment line0819 isR ¤9Y ¡t96l C zl . The time comple xity of line 2025 isR.<9Y0tOnR .b¤9 ¡i0OnR. ¤70t pR .2¤9 ¡i0. So, the running time of the o v erall algorithm isR .h9 7ON ; 0vOnR.c¤ ¡i0vOqR ¤9 ¡o9l C xl OnR .c¤9 ¡i0u kR .<9 7OW ; 0OqR ¤9Y ¡o9l) C zl . Assuming that ; is upper bounded by9 Cand9 l is upper boounded by9 , the abo v e e xpression can be simplied to4¤9 ¡ l . 3.4 Experimental Results Both RCDFC and TCDFC schedulers were implemented in C and tested with selected benchmark circuits. The benchmarks used are :3AutoRe gressi v e (ARF) lter [162 ]3BandP ass lter (BPF) [27 ] 91 PAGE 109 3EllipticW a v e lter (EWF) [163 ]3DCT [164 ]3FIR lter [91 ]3HAL dif ferential equation solv er [21 ]. The FUs used are ALUs and multipliers. The ener gy v alues are computed using the datapath components gi v en in [54 55 ]. The follo wing notations are used to e xpress the results :3 ; and; are the total ener gy consumption (ing) for single supply v oltage and multiple supply v oltage operations respecti v ely .3 ;1Y% and;1Y% are the ener gydelayprodu cts (in4"6 @r d[) for single supply v oltage and single frequenc y and for multiple supply v oltage and dynamic clocking operations respecti v ely .3 and are the corresponding delays (in) for the tw o modes of operations.3 p denotes the number of clock steps of the schedule for single supply v oltage and and single frequenc y operations.3 p is the equi v alent clock steps of found out taking the delay of slo west functional unit as the base clock width in case of multiple v oltage operation.3The percentage ener gy sa vings is calculated as,s; ut wv t 4"D". In similar manner we calculated percentage reduction in EDP which is denoted assV;^ %. F or RCDFC scheduler the e xperimental setup is as follo ws. The algorithm w as tested using the dif ferent sets of resource constraints listed in T able 3.11. The e xperimental results for v ar ious benchmark circuits are reported in T able 3.12. The ener gy estimation includes the ener gy consumption of the o v erhead units. It is assumed that each resource has equal switching acti vity The results are reported for tw o supply v oltage and for switching "#. It is obsorv ed that the ener gy consmption is increased for higher switching and decreased for lo wer switching acti vity 92 PAGE 110 T able 3.11. Resource Constraints used in our Experiements Resource Constraints Assigned Multipliers ALUs Serial No. 3.3 V 5.0 V 3.3 V 5.0 V (RC) 2 1 1 1 1 3 0 1 1 2 2 0 0 2 3 1 1 0 2 4 b ut, under the assumption that switching is same for each resource, the percentage ener gy sa vings is not af fected. W e also conducted e xperiments with three supply v oltage le v els and it is found that the percentage ener gy sa vings could only increase by. Fig. 3.9(a) sho ws the percentage sa vings (a v erages;) a v eraged o v er all resource constraints. From the chart it is e vident that the scheduling yields approximately equal sa vings for all kinds of benchmark circuits. The EDP reduction (a v erages;1Y%) a v eraged o v er all resource constraints are sho wn in Fig. 3.9(c). From the abo v e, we may conclude that the scheduling algorithm yields appreciable ener gy sa vings and EDP reduction. In order to nd the right combination of the types and the number of resources that will yield the best results in terms of ener gy reduction and high performance, we plotted ener gy consumption (%) v ersus time ratio (¡ v ¡ t), which is nothing b ut the the conguration correspoding to maximums;^ %. Based on this analysis, the processor congurations that yield the lo west e x ecution time for each benchmark is listed in T able 3.13. The TCDFC scheduler w as tested for three dif ferent time constraints: 1.5, 1.75 and 2.0 times critical path delay ( r). The v oltage constraint is relax ed unlik e the RCDFC. The results for v arious benchmark circuits are reported in T able 3.14. Fig. 3.9(b) sho ws the chart indicating the ener gy sa vings for dif ferent benchmarks a v eraged o v er all time constraints. Our observ ation is that circuits which require equal number of ALUs related operations (addition, subtraction or comparison) and multiplier operations sa v e more ener gy The ener gy sa vings increased as the time constraints relax ed from rtoA" r. The ener gy sa vings from the proposed RCDFC scheduling algorithm is listed alongwith other resource constrained multiple v oltage scheduling algorithms in T able 3.15. The minimum and 93 PAGE 111 T able 3.12. Ener gy Details for Dif ferent Benchmarks (for "#) using RCDFC Scheduler R Ener gy Estimates Ener gyDelayProduct T ime Estimates C .g0 4"# @r .bor c ycles0 ; ; s; ;^ % ;1Y% s;^ % p p (1) 1 36168 21768 40 20093 19954 1 10 556 917 9 A 2 36168 18205 50 20093 16688 17 10 556 917 9 R 3 36168 19065 47 20093 18006 10 10 556 944 9 F 4 36168 27617 24 26121 31452 N A 13 722 1139 10 A v erage Data 40.3 7.0 (2) 1 27654 16491 40 13827 14659 N A 9 500 889 8 B 2 27654 14175 49 13827 12600 9 9 500 889 8 P 3 27654 14827 46 13827 12356 11 9 500 833 8 F 4 27654 20172 27 26118 23253 11 17 944 1153 10 A v erage Data 40.5 7.8 (3) 1 19404 10802 44 17248 12902 25 16 889 1194 11 E 2 19404 10802 44 17248 12902 25 16 889 1194 11 W 3 19404 10853 44 17248 11154 35 16 889 1028 10 F 4 19404 11922 39 29106 17055 41 27 1500 1431 12 A v erage Data 42.8 31.5 (4) 1 30675 17846 42 25547 26274 N A 15 833 1472 14 D 2 30675 17846 42 25547 26274 N A 15 833 1472 14 C 3 30675 18008 41 25548 25511 0 15 833 1416 13 T 4 30675 18008 41 49392 37267 25 29 1611 2069 17 A v erage Data 41.5 6.3 (5) 1 18678 9979 47 11414 6653 42 11 611 667 7 F 2 18678 9979 47 11414 6653 42 11 611 667 7 I 3 18678 10126 45 11414 6470 43 11 611 639 6 R 4 18678 10127 46 15565 12096 22 15 833 1194 10 A v erage Data 46.3 37.3 (6) 1 13596 8927 34 3021 2728 10 4 222 306 3 H 2 13596 6433 53 3021 1966 35 4 222 306 3 A 3 13596 6648 51 3021 2401 21 4 222 361 4 L 4 13596 10211 25 3777 4396 N A 5 278 431 4 A v erage Data 40.8 16.5 Ov erall A v erage Data 42.0 17.7 94 PAGE 112 T able 3.13. Configurations for Minimum EDP using RCDFC BenchProcessor Congurations mark Multipliers ALUs Circuits 3.3 V 5.0 V 3.3 V 5.0 V AR 3 0 1 1 BPF 2 0 0 1 EWF 2 0 0 1 DCT 1 1 0 1 FIR 2 0 0 2 HAL 3 0 1 1 T able 3.14. Ener gy Sa vings using TCDFC Scheduler Bench. T ime Ener gy consumption and sa vings Circuits Cons. ; .g0 ; .0 s; ./V0 1.5 r 36186 21491 41 (1) ARF 1.75r 36186 18139 47 2.0 r 36186 15274 58 A v erage Data 48.67 1.5 r 27672 15187 45 (2) BPF 1.75r 27672 9350 66 2.0 r 27672 8249 70 A v erage Data 60.33 1.5 r 19422 12335 36 (3) EWF 1.75r 19422 8814 55 2.0 r 19422 5341 73 A v erage Data 54.67 1.5 r 30675 14611 52 (4) DCT 1.75r 30675 14489 53 2.0 r 30675 7714 75 A v erage Data 60.0 1.5 r 18696 4910 74 (5) FIR 1.75r 18696 4877 74 2.0 r 18696 4820 74 A v erage Data 74.0 1.50 r 13614 7808 43 (6) HAL 1.75r 13614 6821 50 2.0 r 13614 4449 67 A v erage Data 53.33 Ov erall A v erage Data 58.50 95 PAGE 113 1 2 3 4 5 6 0 5 10 15 20 25 30 35 40 45 50 Different Benchmark Circuits >Average Energy Savings (%) > (a) Ener gy Reduction for RCDFC 1 2 3 4 5 6 0 10 20 30 40 50 60 70 80 Different Benchmark Circuits >Average Energy Savings (%) > (b) Ener gy Reduction for TCDFC 1 2 3 4 5 6 0 5 10 15 20 25 30 35 40 Different Benchmark Circuits >EDP Reduction (%) > (c) EDP Reduction for RCDFC Figure 3.9. A v erage Ener gy and EDP Reduction for Benchmarks maximum range of ener gy sa vings are sho wn in the table. As clear from column (15) of T able 3.12, RCDFC gi v es better ener gy sa vings for lesser time penalties. The ener gy sa vings obtained using dif ferent e xisting multiple v oltage based timeconstraints scheduling algorithm is sho wn in T able 3.16. In all cases, the time constraints are rtoA"nT r. 3.5 Conclusions Our aim is to use frequenc y scaling concepts for ener gyef cient highperformance special propose processor (ASIC) design. The ener gy reduction is achie v ed by v oltage reduction and the performance is maintained by using DFC alongwith multiple v oltages. W e de v eloped resource96 PAGE 114 T able 3.15. Sa vings for V arious Resource Constrained Schedulings Ben. % Ener gy sa vings and time penalties () in c ycles mark RCDFC Shiue[95 ] Sarrafzadeh[90 ] Johnson[65 ] Ckt s; p s; s; s; ARF 2458 910 1114 1116 1620 1724 1659 1018 BPF 2756 810 EWF 3861 1013 1414 1720 1332 2125 1150 1224 DCT 4163 1318 FIR 2067 610 1629 1015 2873 510 HAL 2962 23 1928 56 T able 3.16. Sa vings for V arious T ime Constrained Schedulings Bench% Ener gy sa vings marks TCDFC Chang[51 ] Shiue[95 ] Manzak[97 ] AR 4158 4063 3876 2561 BPF 4570 EWF 3673 4469 1376 1055 FDCT 5275 4369 FIR 7474 HAL 4367 4161 2277 1962 constrained and timeconstrained datapath scheduling algorithms based on dynamic frequenc y clocking. The use of dynamic frequenc y clocking could generate enough slack to apply reduced v oltages which in turn sa v es ener gy It is observ ed that when using tw o supply v oltage le v els an a v erage ener gy reduction of6qand for three supply v oltage le v els, an a v erage reduction ofis obtained for the benchmarks using the RCDFC algorithm. Similarly for TCDFC, an a v erage ener gy reduction of(for1GY r) andDR(forA"GY r) are obtained. The processor congurations for v arious benchmark circuits that w ould result minimum ener gydelaypro duc t were determined through e xperiments. The inte gration of such a scheduler into a lo w po wer datapath synthesis tool will signicantly benet lo w po wer processor design especially for data intensi v e applications. 97 PAGE 115 CHAPTER 4 ENERGY DELA Y PR ODUCT MINIMIZA TION In this chapter we describe an inte ger linear programming (ILP) based datapath scheduling algorithm which incorporates multiple supply v oltages and dynamic frequenc y clocking (MVDFC) for ener gy reduction [64 ]. The scheduling technique assumes the number and type of dif ferent functional units as resource constraints and minimizes the ener gy delay product (EDP). The ener gy sa vings is from the use of multiple supply v oltages while the performance impro v ement from dynamic frequenc y clocking. Further we consider the simultaneous use of multiple supply v oltages and multic yling (MVMC) to achie v e reduction in ener gy and ener gy delay product. Both the MVDFC and MVMC based schemes ha v e been applied to v arious high le v el synthesis benchmark circuits under dif ferent resource constraints. The e xperimental results sho w appreciable reductions in both ener gy and ener gy delay product. This chapter is or ganized as follo ws. W e rst outline the related w orks proposed in the literature. Then we pro vide the ILPformulations to minimize the ener gy delay product. The ne xt section discusses the ILPbased scheduler follo wed by e xperimental results. 4.1 Ener gy Delay Pr oduct of a Datapath Cir cuit A CMOS circuit can be operated in dif ferent modes, namely single supply v oltage and single frequenc y multiple supply v oltages and single frequenc y and multiple supply v oltages and dynamic frequenc y T raditionally CMOS circuits are operated in the single supply v oltage and single frequenc y mode, in which, during each c ycle the clock width is dictated by the slo west operator delay and each functional unit is operated at equal v oltage le v el. In multiple supply v oltages and single frequenc y mode, dif ferent functional units are operated at dif ferent v oltage le v els to reduce ener gy consumption [65 51 89 ]. In this case, ener gy consumption of the le v el con v erters is to be 98 PAGE 116 tak en into account. More recently multiple supply v oltages and dynamic frequenc y clocking mode of operation is being e xplored as a possible strate gy for lo w po wer high performance operation. In dynamic frequenc y clocking, the clock frequenc y is v aried onthey based on the functional unit acti v e in that c ycle. In this scheme, all the units are clock ed by single clock line which switches at run time. This scheme, in particular is suitable for data intensi v e or compute intensi v e, DSP applications. The architecture for dynamic clocking based systems consists of a datapath, a controller and a dynamic clocking unit (DCU). The datapath consists of funtional units with re gisters and multiple xors. The controller decides which functional units are acti v e in each control step and those not acti v e are disabled using a multiple xor The DCU generates the required clock frequenc y usually using clock di vider strate gy [59 62 ] which are submultiples of base frequenc y The base frequenc y is the maximum frequenc y (or multiple of maximum) of an y functional unit at maximum supply v oltage. The controller has storage units to store a parameter called, clock frequenc y inde x ([55 ]) for each control step deri v ed during the datapath scheduling. This clock frequenc y inde x parameter serv es as the clock di viding f actor for the DCU. The c ycle frequenc y is generated dynamically and the functional units with the appropriate supply v oltages are acti v ated. The main o v erheads in this scheme are, le v el con v erters, the dynamic clocking unit, and some additional stor age in the control unit. When a v alue of¤ ru is loaded into the DCU, the DCU pro vides a di vided output clock frequenc y , ¦§B¨c m . Let us assume that the datapath is represented as a sequencing data o w graph. W e use the notations gi v en in T able 4.1 for de v eloping the follo wing ener gy and ener gy delay product for a datapath. The ener gy consumption in an y c ycle¤is the ener gy consumption of all the resources acti v e in¤, which is gi v en as,; : m@ gmb $mb 9 C mb (4.1) The le v el con v erters are considered as resources operating in the control step in which it needs to step up the signal. The total ener gy consumption of the whole DFG (or datapath) is the sum of the 99 PAGE 117 T able 4.1. Notations used in Description : total number of operations in the DFG e xcluding the source and sink nodes (NOOPs) m: an y operation such that^ [ p: total number of control steps in the DFG ¤: an y control step or clock c ycle in DFG : number of resources acti v e in step¤ r : c ycle frequenc y for control step¤ m : switching at resourceused by operationmoperating in step¤ $mb : load capacitance of resourceused by operationmoperating in control step¤ 96m : operating v oltage of resourceused by operationmoperating in control step¤ ; : ener gy consumption of all functional units acti v e in c ycle¤ ;^ %i: ener gy delay product of all functional units acti v e in c ycle¤ : critical path delay of the DFG ;: total ener gy consumption of the DFG ;^ %: total ener gy delay product of the DFG : subscript used for single supply v oltage and single frequenc y operation : subscript used for miltiple supply v oltage and dynamic frequenc y operation *: subscript used for miltiple supply v oltage and multic ycling operation r C : operating clock frequenc y for single frequnc y or multic ycling opeartions ener gy consumption for all c ycles as gi v en in Eqn. 4.2 belo w .; ji @ ; i @ : mE@ mb $m 9 C mb (4.2) The dynamic clocking unit (DCU) is responsible for generating dynamic clock is considered as a resource operating in all the control steps. The ener gy consumptions of the DCU is to be added alongwith Eqn. 4.2, b ut need not be considered for minimization. The critical path delay of the DFG is gi v en by the summation of the in v erse of the clock frequencies. i @ r (4.3) 100 PAGE 118 The total ener gy delay product can be calculated as the product of the total ener gy consumption and the critical path delay as sho wn in the follo wing equation.;1Y% ; i @ : mE@ mb $m 9 C mb i @ r (4.4) This should be the objecti v e function for the scheduling algorithm for minimization. W e are aiming at minimizing both the v oltage and frequenc y Since the objecti v e function in v olv es the product of the tw o v ariables, and is a nonlinear function, we can not use inte ger linear programming (ILP) for its minimization. Hence, in stead of nding the ener gy consumption for each c ycle¤as in Eqn. 4.1, we deri v e the ener gy delay product for each c ycle.;1Y% yx # r r {z r (4.5) The total ener gy delay product of the DFG is the sum of abo v e;^ %for all control steps which is gi v en as follo ws.;^ % ji @ ;^ % ji @ x r r az r i @ : m@ r r z r (4.6) F or single v oltage and single frequenc y mode of operation,9mb andr are the same for an y clock c ycle (¤) and an y operation (). Ho we v er for multiple supply v oltage and multic ycling operation,r is the same for all control steps and let us denote it asr C . F ollo wing the same steps as abo v e the total ener gy delay product of the DFG for multiple supply v oltage and multic ycling operation is gi v en by the follo wing equation.;1Y% O i @ ;1Y% ji @ x # r r z r i @ : mE@ r r {z r (4.7) 101 PAGE 119 4.2 ILP F ormulations In this section, we discuss the ILP formulations to minimize the peak and a v erage po wer consumption of a datapath circuit. W e rst discuss the formulations for multiple supply v oltages and dynamic clocking based system follo wed by multiple supply v oltages and multic ycling based system. In order to formulate an ILP based model for the objecti v e function and the scheduling scheme for the DFG, the notations gi v en in T able 4.2 are required. T able 4.2. Notations used in ILP F ormulations & : functional unit of typeoperating at v oltage le v el> : maximum number of functional units of typeoperating at v oltage le v el> m: as soon as possible (ASAP) time stamp for the operationm ;m: as late as possible (ALAP) time stamp for the operationm ;^ %. :B>v: r 0: ener gy delay product of functional unit used by operationAm operating at v oltage le v el>and frequenc yr mb : decision v ariable which tak es the v alue of if operationmis scheduled in control step¤ using the functional unit& and¤has frequenc yr 8 mb C l: decision v ariable which tak es the v alue ofifmis using the functional unit & and scheduled in control steps mb : latenc y for operationDmusing resource operating at v oltage> (in terms of number of clock c ycles) 4.2.1 ILP F ormulations : Dynamic Fr equency Clocking First, we deri v e the ILP formulation for the objecti v e function gi v en in Eqn. 4.6 for multiple supply v oltages and dynamic clocking frequenc y Objective Function : The objecti v e function minimizes the total ener gy delay product of the entire DFG. Using the decision v ariable mb , we write the objecti v e function as follo ws.* EL L ;^ % EL L m mb ;1Y%Y. :B>: r 0(4.8) 102 PAGE 120 Uniqueness Constr aints : These constraints ensure that each operation#mis scheduled to an unique control step within the mobility range ( m,;m) with a particular supply v oltage and operating frequenc y W e represent them as, , p, mb (4.9) Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in earlier control steps and its successors are scheduled in later control steps. These are modelled as, :0D: PAGE 121 o v er all control steps using multiple supply v oltages and multic ycling.* EL L ;^ % O EL L C m 8 mb C C n r ~ @ T;^ %. :B>v: r C 0(4.12) Uniqueness Constr aints : These constraints ensure that each operation#mis scheduled in the appropriate control step within the mobility range ( m,; m) be gin assigned the specic supply v oltage. An operation may be operated with more than one clock c ycle sometimes depending on the supply v oltage. These constraints are represented as, ,' ?, @ n r ~ C 8 mb C C n r ~ @ (4.13) When an operation is scheduled at the highest v oltage, then it is scheduled in one unique control step, whereas, when the y are to be operated at lo wer v oltages the y need more than one clock c ycle for completion. Thus, for lo wer v oltages the mobility is restricted. Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in earlier control steps and its successors are scheduled in later control steps. These constraints should also tak e care of the multic ycling operations. These are modeled as, :0:< m 1%}L6 5 , C O mb d q0B8 m C C n r ~ @ d C H8 h C C n r ~ @ d'(4.14) Resour ce Constr aints : These constraints ensure that each control step contains no more than& operations of typeoperating at v oltage>. This can be enforced as,u>and ,' p, m C 8 mb C C n r ~ @ (4.15) 104 PAGE 122 4.3 Datapath Scheduling Algorithm In this section, we discuss the solution for the ILP formulations obtained in the pre vious section. The same tar get architecture and the same characterised datapath components used in [55 ] are assumed. The ILP based scheduler attempts to minimize the EDP is outlined in Fig. 4.1. The rst step is to determine the ASAP and ALAP time stamp of each operation. The ASAP time stamp is the start time and ALAP time stamp is the nish time of each operation. These tw o times pro vide the mobility of a operation and the operation must be scheduled in this mobile range. Then the scheduler nds the ILP formulations based the models described in Section 4.2. The scheduler determines the c ycle frequencies in step 6, which contrib ute the smallest frequencies of all operations scheduled in a particular c ycle. Finally we estimate the ener gy delay product and the ener gy consumptions of the whole DFG. Step 1 : Determine the ASAP and ALAP schedules of the UDFG. Step 2 : Determine the mobility graph of each node. Step 3 : Construct the ILP formulations for the DFG. Step 4 : Solv e the ILP formulations using LPSolv e. Step 5 : Find the scheduled DFG. Step 6 : Determine the c ycle frequencies. Step 7 : Find the ener gy and EDP estimates of the DFG. Figure 4.1. ILP Based Scheduling for Lo w EDP 4.3.1 Scheduling f or MVDFC W e illustrate the solution for the ILP formulation in the MVDFC case, with the help of the DFG sho wn in Fig. 4.2. The ASAP schedule is sho wn in Fig. 4.2(a) and the ALAP schedule is sho wn in Fig. 4.2(b). From the ASAP and ALAP schedules, we obtain the mobility graph as in Fig. 4.2(c). Using this mobility graph, we ha v e the ILP formulations sho wn in Fig. 4.3 for the resource constrain (RC2), three multipliers atA9, one ALU atA9, and one ALU operating atZ Z 9. W e solv ed the formulations using LPsolv e and based on the results, we obtained the scheduled DFG sho wn is Fig. 4.3(d). In Fig. 4.3, we used the follo wing additional notations,*W"! £: number of 105 PAGE 123 1* 0 2 5 6 7 4 Source Sink* * + + +NOP NOP 3 c0 c1 c2 c3 c4 1 0 1 2 3 4 5 6 NOP NOP 7* * + + +Source Sink c0 c1 c2 c3 c4 1 2 4 3 5 6(a) ASAP Schedule (b) ALAP Schedule * + + +Source 0 NOP 2 3* *4 5 7+ + +NOP Sink 2.4V 2.4V(c) Mobility Graph (d) Final Schedule2.4V 3.3V 2.4V 2.4V 6 Figure 4.2. Example Data Flo w Graph for Multiple Supply V oltages and Dynamic Frequenc y Clocking multipliers at v oltage le v el 1,*+"! B: number of multipliers at v oltage le v el 2,*+ !: number of ALUs at v oltage le v el 1, and* !: number of ALUs at v oltage le v el 2. 4.3.2 Scheduling f or MVMC W e illustrate the solution for the ILP formulation of the MVMC case, using the DFG sho wn in Fig. 4.4. The ASAP schedule is sho wn in Fig. 4.4(a) and the ALAP schedule is sho wn in Fig. 4.4(b). From the ASAP and ALAP schedules, we obtain the mobility graph sho wn in Fig.4.4(c). It should be noted that this mobility graph is dif ferent from that sho wn in Fig. 4.2(c). In the MVMC case, the mobility graph considers the multic ycle operations. W e assume tw o operating v oltage le v els, and when a multiplier is operated at the lo wer v oltage le v el, it tak e tw o clock c ycles for 106 PAGE 124 /* ILP F ormulation for Ener gy Delay Product Minimization for MVDFC scheme */ /* Objecti v e Function */ min: 106.6 x1111 + 213.2 x1112 + 56.4 x1121 + 112.8 x1122 + 106.6 x1211 + 213.2 x1212 + 56.4 x1221 + 112.8 x1222 + 106.6 x2111 + 213.2 x2112 + 56.4 x2121 + 112.8 x2122 + 106.6 x3111 + 213.2 x3112 + 56.4 x3121 + 112.8 x3122 + 106.6 x3211 + 213.2 x3212 + 56.4 x3221 + 112.8 x3222 + 2.8 x4211 + 5.5 x4212 + 1.5 x4221 + 2.9 x4222 + 2.8 x5211 + 5.5 x5212 + 1.5 x5221 + 2.9 x5222 + 2.8 x5311 + 5.5 x5312 + 1.5 x5321 + 2.9 x5322 + 2.8 x6311 + 5.5 x6312 + 1.5 x6321 + 2.9 x6322; /* Uniqueness Constraints */ x1111 + x1112 + x1121 + x1122 + x1211 + x1212 + x1221 + x1222 = 1; x2111 + x2112 + x2121 + x2122 = 1; x3111 + x3112 + x3121 + x3122 + x3211 + x3212 + x3221 + x3222= 1; x4211 + x4212 + x4221 + x4222 = 1; x5211 + x5212 + x5221 + x5222 + x5311 + x5312 + x5321 + x5322 = 1; x6311 + x6312 + x6321 + x6322 = 1; /* Precedence Constraints */ 3 x6311 + 3 x6312 + 3 x6321 + 3 x6322 2 x1211 2 x1212 2 x1221 2 x1222 x1111 x1112 x1121 x11221; 2 x4211 + 2 x4212 + 2 x4221 + 2 x4222 x2111 x2112 x2121 x21221; 3 x6311 + 3 x6312 + 3 x6321 + 3 x6322 x4211 x4212 x4221 x42221; 3 x5311 + 3 x5312 + 3 x5321 + 3 x5322 + 2 x5211 + 2 x5212 + 2 x5221 + 2 x5222 2 x3211 2 x3212 2 x3221 2 x3222 x3111 x3112 x3121 x31221; /* Resource Constraints */ x1111 + x2111 + x3111 + x1112 + x2112 + x31120; /* mult1 */ x1121 + x2121 + x3121 + x1122 + x2122 + x31223; /* mult2 */ x1211 + x3211 + x1212 + x32120; /* mult1 */ x1221 + x3221 + x1222 + x32223; /* mult2 */ x4211 + x5211 + x4212 + x52121; /* alu1 */ x4221 + x5221 + x4222 + x52221; /* alu2 */ x5311 + x6311 + x5312 + x63121; /* alu1 */ x5321 + x6321 + x5322 + x63221; /* alu2 */ /* Frequenc y Constraints */ x1121 = 0; x1221 = 0; x2121 = 0; x3121 = 0; x3221 = 0; x4221 = 0; x5221 = 0; x5321 = 0; x6321 = 0; /* ZeroOne T ype Cast */ INT x1111, x1112, x1121, x1122, x1211, x1212, x1221, x1222, x2111, x2112, x2121, x2122, x3111, x3112, x3121, x3122, x3211, x3212, x3221, x3222, x4211, x4212, x4221, x4222, x5211, x5212, x5221, x5222, x5311, x5312, x5321, x5322, x6311, x6312, x6321, x6322; Figure 4.3. ILP F ormulation for Example DFG for Multiple Supply V oltages and Dynamic Frequenc y Clocking 107 PAGE 125 * + + 6 + 4 0 2 5 6 7 4 Source Sink * + + + NOP NOP 3 c0 c1 c2 c3 c4 1 0 1 2 3 4 5 6 NOP NOP 7 * + + + Source Sink (b) ALAP Schedule (a) ASAP Schedule * + + + 1 2 3 4 5 6 NOP (d) Final Schedule (c) Mobility Graph NOP Source 0 7 Sink c1 c2 c3 c4 c5 c0 1 3 5 3.3V 2.4V 2.4V 2.4V 2.4V 2 2.4V Figure 4.4. Example DFG (for RC2) (MVMC) completing the operation. F or the characterised cells used in our e xperiment [55 ], the operating clock frequenc y ,r C isSD*+. Using this mobility graph, we ha v e the ILP formulations sho wn in Fig. 4.3 for the resource constrain (RC2), three multipliers atA9, one ALU atA9, and one ALU operating atZ Z 9. W e solv ed the formulation using LPsolv e and based on the results we obtained the scheduled DFG sho wn is Fig. 4.2(d). In Fig. 4.5, the notations, such as,*+"! ,*+"! },* !iand* !uare the same as those used in the case of the MVDFC. 108 PAGE 126 /* ILP F ormulation for Ener gy Delay Product Minimization for MVMC scheme */ /* Objecti v e Function */ min: 106.6 x1111 + 106.6 x1122 + 106.6 x1133 + 56.4 x1212 + 56.4 x1223 + 106.6 x2111 + 106.6 x2122 + 56.4 x2212 + 106.6 x3111 + 106.6 x3122 + 106.6 x3133 + 56.4 x3212 + 56.4 x3223 + 2.8 x4122 + 2.8 x4133 + 1.5 x4222 + 1.5 x4233 + 2.8 x5122 + 2.8 x5133 + 2.8 x5144 + 1.5 x5222 + 1.5 x5233 + 1.5 x5244 + 2.8 x6133 + 2.8 x6144 + 1.5 x6233 + 1.5 x6244; /*Uniqueness Constraints*/ x1111 + x1122 + x1133 + x1212 + x1223 = 1; x2111 + x2122 + x2212 = 1; x3111 + x3122 + x3133 + x3212 + x3223 = 1; x4122 + x4133 + x4222 + x4233 = 1; x5122 + x5133 + x5144 + x5222 + x5233 + x5244 = 1; x6133 + x6144 + x6233 + x6244 = 1; /* Resource Constraints */ x1111 + x2111 + x31110; /* Mmult1 */ x1212 + x2212 + x32123; /* Mmult2 */ x1122 + x2122 + x31220; /* Mmult1 */ x1212 + x1223 + x2212 + x3212 + x32233; /* Mmult2 */ x1133 + x31330; /* Mmult1 */ x1223 + x32233; /* Mmult2 */ x4122 + x51221; /* Malu1 */ x4222 + x52221; /* Malu2 */ x4133 + x5133 + x61331; /* Malu1 */ x4233 + x5233 + x62331; /* Malu2 */ x5144 + x61441; /* Malu1 */ x5244 + x62441; /* Malu2 */ /* Precedence Constraints */ 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 3 x1133 3 x1223 2 x1122 2 x1212 x11111; 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 3 x4133 3 x4233 2 x4122 2 x42221; 3 x4133 + 3 x4233 + 2 x4122 + 2 x4222 2 x2122 2 x2212 x21111; 4 x5144 + 4 x5244 + 3 x5133 + 3 x5233 + 2 x5122 + 2 x5222 3 x3133 3 x3223 2 x3122 2 x3212 x31111; /* Inte ger Constraints */ INT x1111, x1122, x1133, x1212, x1223, x2111, x2122, x2212, x3111, x3122, x3133, x3212, x3223, x4122, x4133, x4222, x4233, x5122, x5133, x5144, x5222, x5233, x5244, x6133, x6144, x6233, x6244; Figure 4.5. ILP F ormulation for Example DFG for Multiple Supply V oltages and Multic ycling 109 PAGE 127 4.4 Experimental Results W e tested the ILP scheduler with selected benchmark circuits, such as, (1) Example circuit, (2) FIR lter (3) IIR lter (4) HAL dif ferential equation solv er and (5) Auto re gressi v e lter The functional units (FUs) assumed are ALUs and MUL Ts. The datapath cells and their characterization are considered from [55 ]. The follo wing notations are used to e xpress results :3 ; ,; Oand; represent the total ener gy consumption (in) for single supply v oltage, MVDFC and MVMC operations respecti v ely .3 ;1Y% ,;^ % Oand;^ % are the ener gydelayprod uc ts (in4" @r d) for single supply v oltage and single frequenc y for multiple supply v oltage and single frequenc y and for multiple supply v oltage and dynamic clocking operations, respecti v ely .3Rhe percentage ener gy sa vings is calculated as,sV; O ut t Y4"D"ands; t v t 4"D".3The percentage EDP reductionsV;^ % Ois calculated as,sV;^ % O n t n n t 4"D"ands;^ % n t n v n t 4"D". The datapath scheduling algorithms were tested using the dif ferent sets of resource constraints listed belo w (RC1) multipliers (atA9andatZ Z 9) and ALUs (atA9andatZ Z 9) (RC2) multipliers (ZatA9) and ALUs (atA9andatZ Z 9) (RC3) multipliers (atA9) and ALUs (atZ Z 9) (RC4) multipliers (atA9) and ALUs (atZ Z 9) The e xperimental results for v arious benchmark circuits are reported in T able 4.3. Fig. 4.6 sho ws the results for the v arious benchmarks a v eraged o v er dif ferent resource constraints. The ener gy estimation includes the ener gy consumption of the o v erheads. The results reported are based on the assumption of tw o supply v oltages and switching acti vity of"#. The ener gy sa vings for the proposed algorithm is listed alongwith other multiple v oltage scheduling algorithms in T able 4.4. 110 PAGE 128 T able 4.3. Ener gy and EDP Estimates for Benchmarks for MVDFC and MVMC Schemes R Ener gy Estimates ( ) Ener gy Delay Products ( ) C 1 2 3 4 5 6 7 8 9 10 11 12 (1) 1 2955 2013 1572 31.9 46.8 985.0 894.7 873.3 9.2 11.3 E 2 2955 1572 1572 46.8 46.8 985.0 698.7 698.7 29.1 29.1 X 3 2955 1596 1596 46.0 46.0 985.0 886.7 798.0 10.0 19.0 P 4 2955 1596 1596 46.0 46.0 1313.3 886.7 886.7 32.5 32.5 A v erage Reduction 42.7 46.4 20.2 23.0 (2) 1 4900 3040 2587 38.0 47.2 2722.2 2026.7 2299.6 25.6 15.5 F 2 4900 2587 2587 47.2 47.2 2722.2 1724.7 2012.1 36.6 26.1 I 3 4900 2635 2635 46.2 46.2 2722.2 2049.4 2049.4 24.7 24.7 R 4 4900 2635 2635 46.2 46.2 2722.2 2049.4 2049.4 24.7 24.7 A v erage Reduction 44.4 46.7 27.9 22.8 (3) 1 4900 3958 3052 19.2 37.7 2177.8 2198.8 2373.8 N A N A I 2 4900 2587 2549 47.2 47.0 2177.8 1724.7 2021.4 20.8 7.2 I 3 4900 2635 2635 46.2 46.2 2722.2 2342.2 2049.4 14.0 24.7 R 4 4900 2635 2635 46.2 46.2 2722.2 2342.2 2049.4 14.0 24.7 A v erage Reduction 39.7 44.3 12.2 18.9 (4) 1 5885 4013 3119 31.8 47.0 2615.6 2675.3 2425.9 N A 7.3 H 2 5885 3119 3107 47.0 47.2 2615.6 2079.3 2071.3 20.5 20.8 A 3 5885 3167 3167 46.2 46.2 2615.6 2463.2 2287.3 5.8 12.5 L 4 5885 3167 3167 46.2 46.2 3269.4 3319.3 2463.2 N A 24.7 A v erage Reduction 42.8 46.6 6.6 16.3 (5) 1 5000 2639 2639 47.2 47.2 5555.6 3811.8 4398.3 31.4 20.8 A 2 5000 2639 2639 47.2 47.2 5555.6 3811.8 4398.3 31.4 20.8 R 3 5000 2735 2735 45.3 45.3 5555.6 6839.4 3798.6 N A 31.6 F 4 5000 2735 2735 45.3 45.3 5555.6 6839.4 3798.6 N A 31.6 A v erage Reduction 46.3 46.3 15.7 26.2 Ov erall A v erage Reduction 43.2 46.1 16.5 21.4 111 PAGE 129 1 2 3 4 5 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction ( Avg % ) >MVDFC 1 2 3 4 5 0 5 10 15 20 25 30 Different Benchmark Circuits >EDP Reduction ( Avg % ) >MVDFC 1 2 3 4 5 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction ( Avg % ) >MVMC 1 2 3 4 5 0 5 10 15 20 25 30 Different Benchmark Circuits >EDP Reduction ( Avg % ) >MVMC Figure 4.6. Reduction for Dif ferent Benchmarks Expressed as Percentage in A v erage From the table, we observ e that both the ener gy and the ener gy delay product are reduced considerably for both MVDFC and MVMC schemes. The MVDFC scheme results in better sa vings than due to that of the MVMC scheme for most of the cases, e xcept the FIR benchmark. The ener gy sa vings of both the MVDFC and MVMC schemes are the same for most cases e xcept for fe w resource constraints. The sa vings w ould ha v e been same for both the schemes on using ener gy as objecti v e function, as the ener gy sa vings is due to the v oltage reduction, not due to the dynamic frequenc y clocking or multic ycling. Ho we v er use of ener gy as objecti v e function w ould ha v e increased the ener gy delay product, thus reducing the performance. 112 PAGE 130 T able 4.4. Sa vings for V arious Schedulings Schemes Bench% A v erage ener gy sa vings mark This w ork Shiue Sarrafzadeh Johnson Chang Mohanty Circuits DFC MC [95 ] [90 ] [65 ] [51 ] [55 ] (2)r 47 44 23 53 46 (3)iir 44 40 36 (4)hal 47 43 24 36 40 (5)arf 46 46 12 18 39 29 39 4.5 Conclusions Our aim is to use frequenc y scaling concepts for ener gyef cient highperformance ASIC design. The ener gy reduction is achie v ed through the use of v oltage reduction and highperformance by using DFC. This chapter introduced a ILP based resourceconstrain ed datapath scheduling algorithm using both multiple supply v oltages and dynamic frequenc y clocking. It is observ ed that using tw o supply v oltage le v els, an a v erage ener gy reduction ofand an a v erage EDP reduction of#qis obtained using MVDFC. Whereas, for MVMC scheme an a v erage ener gy reduction of Z and a v erage EDP reduction of7is obtained. If in the critical path there are proportionate number of multipliers and ALUs such that the net performance de gradation due to the lo w frequenc y operation of multipliers can be o v ercome by high frequenc y operation of ALUs then the reduction w as signicant. W ith such a scheduler incorporated into a lo wpo wer datapath synthesis tool will greatly benet lo w po wer processor design especially for compute intensi v e applications. 113 PAGE 131 CHAPTER 5 PEAK PO WER AND A VERA GE PO WER MINIMIZA TION The use of multiple supply v oltages for ener gy and a v erage po wer reduction is well researched and se v eral w orks ha v e appeared in the literature. Ho we v er in lo w po wer design for deep submicron and nanometer re gimes, the peak po wer peak po wer dif ferential, a v erage po wer and total ener gy are equally critical design constraints. In this w ork, we propose datapath scheduling algorithms for simultaneous minimization of peak and a v erage po wer [46 ]. The minimization schemes based on inte ger linear programming (ILP) are de v eloped for the design of datapaths that can function in three modes of operation: (1) single supply v oltage and single frequenc y (SVSF), (2) multiple supply v oltages and dynamic frequenc y clocking (MVDFC) and (3) multiple supply v oltages and multic ycling (MVMC). The use of dynamic frequenc y clocking is ef fecti v e for po wer reduction in design of data intensi v e signal processing applications. The ef fecti v eness of our proposed technique is measured by estimating the peak po wer consumption, the a v erage po wer consumption and the po wer delay product of the datapath circuits. V arious e xperiments are conducted on selected highle v el synthesis benchmark circuits under dif ferent resource constraints. This chapter is or ganized as follo ws. The ILPformulations to minimize the peak and a v erage po wer consumption are described rst. The ILPbased scheduler is then introduced, follo wed by e xperimental results. W e also in v estigated the scheduling schemes for only peak po wer minimzation without considering a v erage po wer which is represented in the last section. 5.1 P eak and A v erage P o wer Consumption of a Datapath Cir cuit In this section, we rst mention the dif ferent notations and terminology needed for a scheduling model. Let us assume that the datapath is represented in the form of a sequencing data o w graph. The datapath uses v arious resources or functional units operating at dif ferent supply v oltages. The 114 PAGE 132 le v el con v erters are considered as resource o v erheads often needed when the v oltage le v el needs to be stepped up in an y control step. The dynamic clocking unit (DCU) that generates the dif ferent frequenc y le v els is also accounted as a resource that will operate during all the control steps. The notation and terminolgies are gi v en in T able 5.1. It may be noted that for single frequenc y and single supply v oltage mode of operation,9mb andr are the same for an y clock c ycle (¤) and resource (). Similarly for multic ycling operationr is the same for an y clock c ycle (¤). T able 5.1. Notations used in Description ¤: an y control step or clock c ycle in DFG p: total number of control steps in the DFG : number of resources acti v e in step¤ r : c ycle frequenc y for control step¤ mb : switching at resourceoperating in step¤ $mb : load capacitance of resourceoperating in control step¤ 96mb : operating v oltage of resourceoperating in control step¤ % : po wer consumption for the DFG for an y control step¤ %ur: maximum po wer consumption for the DFG %ik: a v erage po wer consumption for the DFG : critical path delay of the DFG %' %: po wer delay product of the DFG The po wer consumption for an y control step¤is% : m@ gmb $mb 9 C mb r (5.1) The peak po wer consumption of the DFG is the maximum po wer consumption o v er all the control steps which is e xpressed as belo w .% r * % & @B C£ i(5.2) W e re write Eqn. 5.2 using Eqn. 5.1 as follo ws.% r *+ : m@ mb $mb 9 C mb r & @B C£ i(5.3) 115 PAGE 133 The a v erage po wer consumption of the DFG is characterised as the mean of the c ycle po wers (% ) for all control steps.%ik @ i i m@ % (5.4) Again using Eqn. 5.1, we re write Eqn. 5.4 as follo ws.%k @ i ji m@ : mE@ mb $m 9 C mb r (5.5) Since the simultaneous reduction of both peak and a v erage po wer is aimed for the objecti v e function to be minimized by the scheduling algorithm is the sum of Eqn. 5.3 and 5.5. The critical path delay of the DFG can be calculated as, ji m@ @ (5.6) It should be noted that ther is the same for single frequenc y and multic ycling operations for all v alues of¤and may be dif ferent for dynamic frequenc y clocking operations. The po wer delay product of the DFG is dened as the product of the a v erage po wer consumption and critical path delay as sho wn belo w .%' % %ik(5.7) Using Eqns. 5.4 and 5.6, the follo wing e xpression for the po wer delay product is obtained.%' % @ i ji m@ % tji m@ @ (5.8) Similarly the follo wing e xpression for the po wer delay product is arri v ed using Eqns. 5.5 and 5.6.%^Y% @ i i mE@ : mE@ mb $m 9 C mb r o i mE@ @ (5.9) T o study the impact of the scheduling algorithms on the performance of the datapath the po wer delay product of the scheduled DFGs using the abo v e e xpression will be estimated. 116 PAGE 134 5.2 ILP F ormulations In this section, we discuss the ILP formulations to minimize the peak and a v erage po wer consumption of a datapath circuit. W e rst discuss the formulations for multiple supply v oltages and dynamic clocking based system follo wed by multiple supply v oltages and multic ycling based system. 5.2.1 ILP F ormulations f or DFC In this section, the ILP formulation for simultaneous peak (Eqn. 5.3) and a v erage po wer (Eqn. 5.5) minimization using multiple supply v oltages and dynamic frequenc y clocking (DFC) is described. In dynamic frequenc y clocking [62 63 ], the clock frequenc y is v aried onthey based on the functional units acti v e in that c ycle. In this clocking scheme, all the units are clock ed by a single clock line which switches at runtime. The frequenc y reduction creates an opportunity to operate the dif ferent functional units at dif ferent v oltages, which in turn, helps in further reduction of po wer The notations used for ILP formulations are gi v en in T able 5.2. T able 5.2. Notations used in ILP F ormulations : total number of operations in the DFG e xcluding the source and sink nodes m: an y operation,' & : functional unit of typeoperating at v oltage le v el> : maximum number of functional units of typeoperating at v oltage le v el> m: as soon as possible (ASAP) time stamp for the operationm ; m: as late as possible (ALAP) time stamp for the operation m %. :B>v: r 0: po wer consumption of operationmat v oltage le v el>and operating frequenc yr mb : decision v ariable which tak es the v alue ofif operationmis scheduled in control step¤using the functional unit& and¤has frequenc yr 8 mb C l: decision v ariable which tak es the v alue ofifmis using the functional unit& and scheduled in control steps mb : latenc y for operationDmusing resource operating at v oltage> (in terms of number of clock c ycles) Objective Function : The objecti v e is to minimize the peak po wer and the a v erage po wer consumption of the whole DFG o v er all control steps simultaneously These are already described abo v e in 117 PAGE 135 Eqn. 5.3 and 5.5.* EL L %vrO % k(5.10) Using decision v ariables the objecti v e function can be re written as follo ws :* UL L %vrO @ i m),D r ~ mb T%. :B>: r 0(5.11) It should be noted that the% ris unkno wn and has to be minimized. It may be po wer consumption of an y control step in the DFG depending on the scheduled operations and hence is later used as a constraint. Uniqueness Constr aints : These constraints ensure that each operationAmis scheduled to one unique control step within the mobility range ( m,; m) with a particular supply v oltage and operating frequenc y The y are represented as, ,' ?, mb (5.12) Pr ecedence Constr aints : These constraints ascertain that for an operation#m, all its predecessors are scheduled in an earlier control step and its successors are scheduled in an later control step. These are modelled as, :0D: PAGE 136 at lo wer v oltage le v el then it can not be scheduled in a higher frequenc y control step. These constraints are written as, ,' ,u¤,[¤ p, ifr >, then m ". P eak P ower Constr aints : These constraints mak e certain that the maximum po wer consumption of the DFG does not e xceed% rfor an y control step. These constraints are applied as follo ws,u¤,'¤n pandv>, m),D r ~ mb T%. :B>v: r 0 % r(5.15) 5.2.2 ILP F ormulations f or Multicycling In this section, the ILP formulations for simultaneous minimization of both peak and a v erage po wer consumption of the DFG using multiple supply v oltages and multic ycling will be discussed. Objective Function : The objecti v e is to minimize the peak and a v erage po wer consumption of the whole DFG o v er all control steps. The e xpressions gi v en in Eqn. 5.3 and Eqn. 5.5 are still v alid here, with only dif ference being thatr is the same for all control steps.* EL L %vrO % k(5.16) In terms of decision v ariables, the abo v e is written as :* UL L % r O @ i C m,D r ~ 8 mb C C n r ~ @ T%. :B>: r C 0(5.17) The% ris used as a constraint later Uniqueness Constr aints : These constraints conrm that e v ery operation#mis scheduled in appropriate control steps within the mobility range ( m,; m) with a particular supply v oltage. It may be operated at more than one clock c ycle depending on the supply v oltage. These constraints are 119 PAGE 137 represented as, ,^ , @ n r ~ C 8 mb C C n r ~ @ (5.18) When the operators are operating at highest v oltage, the y are scheduled in one unique control step, whereas, when the y are to be operated at lo wer v oltages the y need more than one clock c ycle for completion. Thus, for lo wer v oltage the mobility is restricted. Pr ecedence Constr aints : These constraints guarantee that for an operation m, all its predecessors are scheduled in an earlier control step and its successors are scheduled in an later control step. These constraints should also tak e care of the multic ycling operations. These are modeled as, :0: PAGE 138 used in [55 ]. In this architecture, le v el con v erters are used when a lo wv oltage functional unit dri v es a highv oltage functional unit [65 ]. Peak po wer consumption of the DFG is minimized by the ILP based scheduler outlined in Fig. 5.1. The rst step is to determine the as soon as possible (ASAP) time stamp of each operation. The second step is the determination of the as late as possible (ALAP) time stamp of each v erte x for the DFG. The ASAP time stamp is the start time and the ALAP time stamp is the nish time of each operation. These tw o times pro vide the mobility of an operation and the operation must be scheduled in this mobile range. This mobility graph needs to be modied for the multic ycling scheme. The scheduler is based on the ILP formulations described in Section 5.2. At this point, the operating frequenc y of a functional unit is assumed as the in v erse of its operational delay determined using the delay model gi v en in [48 ]. The ILP formulations are solv ed to deri v e the scheduled DFG. The scheduler decides the c ycle frequencies based on the formulas gi v en in [48 ]. Finally the po wer consumption of the scheduled DFG is estimated. Step 1 : Find ASAP schedule of the UDFG. Step 2 : Find ALAP schedule of the UDFG. Step 3 : Determine the mobility graph of each node. Step 4 : Modify the mobility graph for multic ycling. Step 5 : Construct the ILP formulations. Step 6 : Solv e the ILP formulations using LPSolv e. Step 7 : Find the scheduled DFG. Step 8 : Determine the c ycle frequencies for DFC scheme. Step 9 : Estimate the po wer consumptions of the DFG. Figure 5.1. ILPBased Scheduler 5.3.1 Scheduler using Multiple V oltages and Dynamic Fr equency Clocking The intermediate steps in the solution for the ILP formulations for the multiple supply v oltages and dynamic frequenc y clocking is illustrated using the DFG sho wn in Fig. 5.2. The ASAP schedule is sho wn in Fig. 5.2(a) and the ALAP schedule is sho wn in Fig. 5.2(b). From the ASAP and ALAP schedules the mobility graph sho wn in Fig. 5.2(c) is determined. W e ha v e sho wn one such ILP formulations in Fig. 5.3 for the resource constraint (RC3), tw o multipliers atA9and tw o ALU operating atZ Z 9using switching acti vity of"#. In Fig. 5.3, we used the follo wing 121 PAGE 139 0 2 5 6 7 4 Source Sink * + + + NOP NOP 3 c0 c1 c2 c3 c4 1 (a) ASAP Schedule 0 1 2 3 4 5 6 NOP NOP 7 * + + + Source Sink (b) ALAP Schedule 1 2 4 3 5 6 * + + + (c) Mobility Graph NOP Source NOP Sink 2 3 4 6 5 * 7 0 + + + 2.4V 2.4V 1 2.4V 3.3V 3.3V 3.3V (d) Final Schedule Figure 5.2. Example DFG for Resource Constraint RC3; using Multiple Supply V oltages and Dynamic Frequenc y Clocking 122 PAGE 140 /* ILP F ormulation for Simultaneous Peak and A v erage Po wer Minimization for MVDFC scheme */ /* Objecti v e function */ min : 2.89 x1111 + 1.44 x1112 + 1.52 x1121 + 0.76 x1122 + 2.89 x2111 + 1.44 x2112 + 1.52 x2121 + 0.76 x2122 + 2.89 x3111 + 1.44 x3112 + 1.52 x3121 + 0.76 x3122 + 2.89 x1211 + 1.44 x1212 + 1.52 x1221 + 0.76 x1222 + 2.89 x3211 + 1.44 x3212 + 1.52 x3221 + 0.76 x3222 + 0.08 x4211 + 0.04 x4212 + 0.04 x4221 + 0.02 x4222 + 0.08 x5211 + 0.04 x5212 + 0.04 x5221 + 0.02 x5222 + 0.08 x5311 + 0.04 x5312 + 0.04 x5321 + 0.02 x5322 + 0.08 x6311 + 0.04 x6312 + 0.04 x6321 + 0.02 x6322 + PP; /* Uniqueness Constraints */ x1111 + x1112 + x1121 + x1122 + x1211 + x1212 + x1221 + x1222 = 1; x2111 + x2112 + x2121 + x2122 = 1; x3111 + x3112 + x3121 + x3122 + x3211 + x3212 + x3221 + x3222 = 1; x4211 + x4212 + x4221 + x4222 = 1; x5211 + x5212 + x5221 + x5222 + x5311 + x5312 + x5321 + x5322 = 1; x6311 + x6312 + x6321 + x6322 = 1; /* Precedence Constraints */ 3 x6311 + 3 x6312 + 3 x6321 + 3 x6322 2 x1211 2 x1212 2 x1221 2 x1222 x1111 x1112 x1121 x11221; 2 x4211 + 2 x4212 + 2 x4221 + 2 x4222 x2111 x2112 x2121 x21221; 3 x6311 + 3 x6312 + 3 x6321 + 3 x6322 2 x4211 2 x4212 2 x4221 2 x42221; 3 x5311 + 3 x5312 + 3 x5321 + 3 x5322 + 2 x5211 + 2 x5212 + 2 x5221 + 2 x5222 2 x3211 2 x3212 2 x3221 2 x3222 x3111 x3112 x3121 x31221; /* Resource Constraints */ x1111 + x2111 + x3111 + x1112 + x2112 + x31120; /* Mmult1 */ x1121 + x2121 + x3121 + x1122 + x2122 + x31222; /* Mmult2 */ x1211 + x3211 + x1212 + x32120; /* Mmult1 */ x1221 + x3221 + x1222 + x32222; /* Mmult2 */ x4211 + x5211 + x4212 + x52122; /* Malu1 */ x4221 + x5221 + x4222 + x52220; /* Malu2 */ x5311 + x6311 + x5312 + x63122; /* Malu1 */ x5321 + x6321 + x5322 + x63220; /* Malu2 */ /* Frequenc y Constraints */ x1121 = 0; x1221 = 0; x2121 = 0; x3121 = 0; x3221 = 0; x4221 = 0; x5221 = 0; x5321 = 0; x6321 = 0; /* Peak Po wer Constraints */ 8.64 x1111 + 4.32 x1112 + 4.56 x1121 + 2.28 x1122 + 8.64 x2111 + 4.32 x2112 + 4.56 x2121 + 2.28 x2122 + 8.64 x3111 + 4.32 x3112 + 4.56 x3121 + 2.28 x3122PP; 8.64 x1211 + 4.32 x1212 + 4.56 x1221 + 2.28 x1222 + 8.64 x3211 + 4.32 x3212 + 4.56 x3221 + 2.28 x3222 + 0.23 x4211 + 0.11 x4212 + 0.12 x4221 + 0.06 x4222 + 0.23 x5211 + 0.11 x5212 + 0.12 x5221 + 0.06 x5222PP; 0.23 x5311 + 0.11 x5312 + 0.12 x5321 + 0.06 x5322 + 0.23 x6311 + 0.11 x6312 + 0.12 x6321 + 0.06 x6322PP; /* Inte ger Constraints */ INT x1111, x1112, x1121, x1122, x1211, x1212, x1221, x1222, x2111, x2112, x2121, x2122, x3111, x3112, x3121, x3122, x3211, x3212, x3221, x3222, x4211, x4212, x4221, x4222, x5211, x5212, x5221, x5222, x5311, x5312, x5321, x5322, x6311, x6312, x6321, x6322; Figure 5.3. ILP F ormulation for Example DFG using DFC, for RC3 and Switching Acti vity ="#123 PAGE 141 * * + + + 1 2 3 4 5 6 c1 c2 c3 c4 c0 (a) Mobility Graph Source NOP 1 2 3 4 6 5 0 7 NOP Sink * + + + 3.3V 3.3V 2.4V 2.4V 2.4V 3.3V (b) Final Schedule Figure 5.4. Example DFG for Resource Constraint RC3; using Multiple Supply V oltages and Multic ycling additional notations : (i) PP : peak po wer (ii)*+"! £: number of multipliers at v oltage le v el 1, (iii)*+"! B: number of multipliers at v oltage le v el 2, (i v)* !i: number of ALUs at v oltage le v el 1, and (v)* !: number of ALUs at v oltage le v el 2. The ILP formulations are solv ed using LPsolv e and the scheduled DFG is sho wn in Fig. 5.2(d). 5.3.2 Scheduler using Multiple Supply V oltages and Multicycling The solution for the ILP formulation for multiple supply v oltages and multic ycling is illustrated using the DFG sho wn in Fig. 5.4. The ASAP schedule is sho wn in Fig. 5.2 and the ALAP schedule is sho wn in Fig. 5.2(a). From the ASAP and ALAP schedules the mobility graph sho wn in Fig. 5.4(a) is obtained. This mobility graph is dif ferent from that sho wn in Fig. 5.2(c); The mobility graph in Fig. 5.4(a) considers the multic ycle operations. T w o operating v oltage le v els are assumed in Fig. 5.4(a). The multipliers tak e tw o clock c ycles when operated at lo w v oltage le v el. F or the characterised cells used in our e xperiment [55 ], the operating clock frequenc y ,r C isSD*+. The ILP formulations are obtained using this mobility graph. W e ha v e sho wn one such ILP formulation 124 PAGE 142 /* ILP F ormulation for Simultaneous Peak and A v erage Po wer Minimization for MVMC scheme */ /* Objecti v e function */ min: 1.7 x1111 + 0.9 x1212 + 1.7 x2111 + 0.9 x2212 + 1.7 x3111 + 0.9 x3212 + 1.7 x1122 + 0.9 x1212 + 0.9 x1223 + 1.7 x2122 + 0.9 x2212 + 0.9 x2223 + 1.7 x3122 + 0.9 x3212 + 0.9 x3223 + 0.05 x4122 + 0.02 x4222 + 0.05 x5122 + 0.02 x5222 + 1.7 x1133 + 0.9 x1223 + 0.9 x1234 + 1.7 x2133 + 0.9 x2223 + 1.7 x3133 + 0.9 x3223 + 0.9 x3234 + 0.05 x4133 + 0.02 x4233 + 0.05 x5133 + 0.02 x5233 + 0.05 x6133 + 0.02 x6233 + 1.7 x1144 + 0.9 x1234 + 1.7 x3144 + 0.9 x3234 + 0.05 x4144 + 0.02 x4244 + 0.05 x5144 + 0.02 x5244 + 0.05 x6144 + 0.02 x6244 + 0.05 x5155 + 0.02 x5255 + 0.05 x6155 + 0.02 x6255 + PP; /* Uniqueness Constraints */ x1111 + x1122 + x1133 + x1144 + x1212 + x1223 + x1234 = 1; x2111 + x2122 + x2133 + x2212 + x2223 = 1; x3111 + x3122 + x3133 + x3144 + x3212 + x3223 + x3234 = 1; x4122 + x4133 + x4144 + x4222 + x4233 + x4244 = 1; x5122 + x5133 + x5144 + x5155 + x5222 + x5233 + x5244 + x5255 = 1; x6133 + x6144 + x6155 + x6233 + x6244 + x6255 = 1; /* Peak Po wer Constraints */ 8.6 x1111 + 4.6 x1212 + 8.6 x2111 + 4.6 x2212 + 8.6 x3111 + 4.6 x3212PP; 8.6 x1122 + 4.6 x1212 + 4.6 x1223 + 8.6 x2122 + 4.6 x2212 + 4.6 x2223 + 8.6 x3122 + 4.6 x3212 + 4.6 x3223 + 0.2 x4122 + 0.1 x4222 + 0.2 x5122 + 0.1 x5222PP; 8.6 x1133 + 4.6 x1223 + 4.6 x1234 + 8.6 x2133 + 4.6 x2223 + 8.6 x3133 + 4.6 x3223 + 4.6 x3234 + 0.2 x4133 + 0.1 x4233 + 0.2 x5133 + 0.1 x5233 + 0.2 x6133 + 0.1 x6233PP; 8.6 x1144 + 4.6 x1234 + 8.6 x3144 + 4.6 x3234 + 0.2 x4144 + 0.1 x4244 + 0.2 x5144 + 0.1 x5244 + 0.2 x6144 + 0.1 x6244PP; 0.2 x5155 + 0.1 x5255 + 0.2 x6155 + 0.1 x6255PP; /* Resource Constraints */ x1111 + x2111 + x31110; /* Mmult1 */ x1212 + x2212 + x32122; /* Mmult2 */ x1122 + x2122 + x31220; /* Mmult1 */ x1212 + x1223 + x2212 + x2223 + x3212 + x32232; /* Mmult2 */ x1133 + x2133 + x31330; /* Mmult1 */ x1223 + x1234 + x2223 + x3223 + x32342; /* Mmult2 */ x1144 + x31440; /* Mmult1 */ x1234 + x32342; /* Mmult2 */ x4122 + x51222; /* Malu1 */ x4222 + x52220; /* Malu2 */ x4133 + x5133 + x61332; /* Malu1 */ x4233 + x5233 + x62330; /* Malu2 */ x4144 + x5144 + x61442; /* Malu1 */ x4244 + x5244 + x62440; /* Malu2 */ x5155 + x61552; /* Malu1 */ x5255 + x62550; /* Malu2 */ /* Precedence Constraints */ 5 x6155 + 5 x6255 + 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 4 x1144 4 x1234 3 x1133 3 x1223 2 x1122 2 x1212 x11111; 5 x6155 + 5 x6255 + 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 4 x4144 4 x4244 3 x4133 3 x4233 2 x4122 2 x42221; 4 x4144 + 4 x4244 + 3 x4133 + 3 x4233 + 2 x4122 + 2 x4222 3 x2133 3 x2223 2 x2122 2 x2212 x21111; 5 x5155 + 5 x5255 + 4 x5144 + 4 x5244 + 3 x5133 + 3 x5233 + 2 x5122 + 2 x5222 4 x3144 4 x3234 3 x3133 3 x3223 2 x3122 2 x3212 x31111; /* Inte ger Constraints */ INT x1111, x1122, x1133, x1144, x1212, x1223, x1234, x2111, x2122, x2133, x2212, x2223, x3111, x3122, x3133, x3144, x3212, x3223, x3234, x4122, x4133, x4144, x4222, x4233, x4244, x5122, x5133, x5144, x5155, x5222, x5233, x5244, x5255, x6133, x6144, x6155, x6233, x6244, x6255; Figure 5.5. ILP F ormulation for Example DFG using Multic ycling, for RC3 and Switching Acti vity ="#125 PAGE 143 in Fig. 5.5 for the resource constraint (RC3), tw o multipliers atA9tw o ALUs atZ Z 9, and switching acti vity "#. In Fig. 5.5, the notations, such as,%'%,*+"! £,*W"! B,* !and* !uha v e same meaning as that of the DFC case sho wn in Fig. 5.3. The ILP formulations are solv ed using LPsolv e and the scheduled DFG is sho wn in Fig. 5.4(b). 5.4 Experimental Results The ILPbased schedulers for both multiple supply v oltages and dynamic clocking frequenc y and multiply supply v oltages and multic ycling schemes were tested with v e highle v el synthesis benchmark circuits : (1) Example circuit (EXP), (2) FIR lter (3) IIR lter (4) HAL dif ferential equation solv er and (5) AutoRe gressi v e lter (ARF). The notations used to e xpress the v arious results are gi v en in T able 5.3. The schedulers were tested using dif ferent sets of resource constraints (RC1,RC2,RC3,RC4,RC5) sho wn in T able 5.4 for each benchmark circuit. The e xperimental results for v arious benchmark circuits are reported in T able 5.5 for both dynamic frequenc y clocking and multic ycling schemes. The po wer is estimated including the o v erheads, such as le v el con v erters (used in both the schemes) and dynamic clocking units (needed for dynamic frequenc y clocking case). It is assumed that each resource has equal switching acti vity (m ). The results are reported for tw o supply v oltages and for switching "#. T o get a visual picture of the e xperimental results, we plotted the peak po wer reductions, a verage po wer reduction and the PDP reductions a v eraged o v er the dif ferent sets of resource constraints. Fig. 5.6 sho ws the a v erage reductions for dif ferent benchmarks a v eraged o v er all resource constraints. It is ob vious from the gure that the reductions using combined multiple supply v oltages and dynamic frequenc y clocking are appreciable. It is observ ed that the po wer consumption increases for higher switching and decreases for lo wer switching acti vity The po wer reductions for the proposed scheduling scheme are listed alongwith other scheduling algorithms dealing with peak po wer reduction in T able 5.6. The table is not to pro vide an e xact comparison, b ut to pro vide a general idea of relati v e performance. 126 PAGE 144 T able 5.3. Notations used in Expressing Results %vr : the peak po wer consumption (in ) for single supply v oltage and single frequenc y operation % r : the peak po wer consumption (in ) for multiple supply v oltages and dynamic frequenc y operation % r O: the peak po wer consumption (in ) for multiple supply v oltages and multic ycle operation %k : the a v erage po wer consumption (in ) for single supply v oltage and single frequenc y operation %k : the a v erage po wer consumption (in ) for multiple supply v oltages and dynamic frequenc y operation %k O: the a v erage po wer consumption (in ) for multiple supply v oltages and multic ycle operation : the critical path delay for single supply v oltage and single frequenc y operation : the critical path delay for multiple supply v oltages and dynamic frequenc y operation O: the critical path delay for multiple supply v oltages and multic ycle operation %' % : the po wer delay product (in) for single supply v oltage and single frequenc y operation./ % k 0 %' % : the po wer delay product (in) for multiple supply v oltage and dynamic frequenc y clocking operation./ %Tk 0 %' % O: the po wer delay product (in) for multiple supply v oltage and multic ycle operation./ N%k O O 0 s%ur : the percentage peak po wer reduction using the multiple supply v oltages and dynamic frequenc y scheme n t n v n t P4"D" s% r O: the percentage peak po wer reduction using the multiple supply v oltages and multic ycle schemeq n t n n t 4"D" s%' % : the percentage PDP reduction using the multiple supply v oltages and dynamic frequenc y scheme n n t n n v n n t P4"D" s%' % O: the percentage PDP reduction using the multiple supply v oltages and multic ycle scheme n n t n n n n t 4"D" 127 PAGE 145 T able 5.4. Resource Constraints used for our Experiement Resource Constraints Resource Multipliers ALUs Constraint 2.4 V 3.3 V 2.4 V 3.3 V Labels 2 1 1 1 RC1 3 0 1 1 RC2 2 0 0 2 RC3 1 1 0 1 RC4 2 0 0 1 RC5 5.5 P eak P o wer Minimization In the pre vious fe w sections we ha v e presented the formulations for simultaneous minimization of peak and a v erage po wer of a datapath circuit. In this section we discuss the ILPbased scheduling scheme that minimizes peak po wer only without e xplicitly considering the a v erage po wer [45 165 ]. The peak po wer consumption presented in Eqn. 5.2 serv es as the objecti v e function. The peak po wer consumption Eqn. has been reproduced here for quick reference, where the notations are the same meaning as used before.% r * % & @B C£ i * : m@ mb $mb 9 C mb r & @z i(5.22) The abo v e equation can be re written as follo ws for multiple supply v oltages and multic ycling operation scenario; clock frequenc y is the same for all control steps and denoted asr C .% r * % & @B C£ i * : mE@ mb $m 9 C mb r C & @z i(5.23) 5.5.1 ILP F ormulations In this section, we formulate the ILP models for peak po wer minimization for both MVDFC and MVMC scenario. The ILP models ensure that the dependenc y constraints and resource constraints are satised. The le v el con v erters are considered as resources operating in the control step in which 128 PAGE 146 T able 5.5. Peak Po wer A v erage Po wer and PDP Estimates for Benchmarks using Scheduling Schemes R Peak Po wer ( ) A v erage Po wer ( ) PDP Estimates ( ) C (1) 1 17.28 4.56 73.6 8.76 49.3 8.86 2.41 72.8 6.57 25.8 2.95 1.33 54.9 2.92 0 e 2 17.28 4.56 73.6 13.68 20.8 8.86 2.41 72.8 6.98 21.2 2.95 1.33 54.9 3.1 0 x 3 17.28 4.56 73.6 9.12 47.2 8.86 2.61 70.5 5.58 37.0 2.95 1.30 55.9 3.1 0 p 4 8.86 2.39 73.0 8.86 0 6.65 1.88 71.7 6.65 0 2.96 1.36 54.1 2.95 0 A v erage v alues 73.5 29.3 72.0 21.0 55.0 0 (2) 1 17.28 4.56 73.6 8.76 49.3 8.82 2.34 73.5 7.28 17.5 4.9 2.34 52.5 4.85 0 f 2 17.28 4.56 73.6 13.68 20.8 8.82 2.35 73.4 7.68 12.9 4.9 2.35 52.0 5.12 0 i 3 17.28 4.56 73.6 13.68 20.8 8.82 2.44 72.3 6.64 24.7 4.9 2.30 53.0 5.12 0 r 4 17.28 6.60 61.8 8.86 48.7 8.82 2.84 67.8 7.35 16.7 4.9 2.68 45.3 4.9 0 A v erage v alues 70.7 34.9 71.8 18.0 50.7 0 (3) 1 25.92 8.88 65.7 17.76 31.5 11.03 3.49 68.4 8.95 18.9 4.9 2.32 52.7 4.97 0 i 2 25.92 6.84 73.6 13.68 47.2 11.03 2.98 73.0 7.68 30.4 4.9 1.98 59.6 5.12 0 i 3 17.28 4.56 73.6 9.12 47.2 8.82 2.45 72.2 5.24 40.6 4.9 2.0 59.2 4.66 4.9 r 4 17.28 6.60 61.8 13.20 23.6 8.82 3.31 62.5 8.05 8.7 4.9 2.57 47.6 5.37 0 A v erage v alues 68.7 37.4 69.0 24.7 54.8 1.0 (4) 1 17.51 4.62 74.7 13.32 23.9 13.25 3.55 73.2 8.82 33.4 5.89 2.76 53.1 5.88 0.2 h 2 17.51 4.62 74.7 13.68 21.9 13.25 3.55 73.2 9.23 30.3 5.89 2.76 53.1 6.15 0 a 3 17.51 4.67 73.3 9.34 46.7 13.25 3.73 71.8 7.98 39.8 5.89 2.69 54.3 6.20 0 l 4 17.51 6.71 61.7 13.42 23.4 10.59 3.73 64.8 8.90 16.0 5.88 3.52 40.1 5.93 0 A v erage v alues 71.1 29.0 70.8 29.9 50.2 0.7 (5) 1 8.86 2.34 73.6 8.64 2.5 4.50 1.20 73.3 3.40 24.4 5.00 2.00 60.0 4.85 3.0 a 2 8.86 2.34 73.6 8.64 2.5 4.50 1.20 73.3 3.58 24.4 5.00 2.00 60.0 4.85 3.0 r 3 8.86 2.39 73.0 8.76 1.1 4.50 1.40 68.9 3.65 18.9 5.00 1.90 62.0 5.0 0 f 4 8.86 2.39 73.0 8.76 1.1 4.50 1.40 68.9 3.46 23.1 5.00 1.90 62.0 5.0 0 A v erage v alues 73.3 1.8 71.1 22.7 61.0 1.1 A v erage o v er all benchmarks 71.5 26.5 71.0 23.3 54.3 0.5 129 PAGE 147 1 2 3 4 5 0 10 20 30 40 50 60 70 80 Different benchmark circuits >Peak power reduction (%) > (a) Peak po wer reduction using DFC scheme 1 2 3 4 5 0 5 10 15 20 25 30 35 40 Different benchmark circuits >Peak power reduction (%) > (b) Peak po wer reduction using multic ycling 1 2 3 4 5 0 10 20 30 40 50 60 70 80 Different benchmark circuits >Average power reduction (%) > (c) A v erage po wer reduction using DFC scheme 1 2 3 4 5 0 5 10 15 20 25 30 35 Different benchmark circuits >Average power reduction (%) > (d) A v erage po wer reduction using multic ycling Figure 5.6. A v erage Reduction for Dif ferent Bechmarks it needs to step up signal. The dynamic clocking unit (DCU) that generates dynamic frequenc y is considered as a resource operating in all the control steps. The po wer dissipation of the le v el con v erters and DCU are included. In order to formulate an ILP based model for Eqn. 5.22 and hence a scheduling scheme for the DFG, we use the same notations gi v en in T able 5.2. 5.5.1.1 Multiple Supply V oltages and Dynamic Fr equency Clocking (MVDFC) In this subsection, we describe the ILP formulation for peak po wer minimization using multiple supply v oltages and dynamic frequenc y clocking. In dynamic frequenc y clocking, the clock 130 PAGE 148 T able 5.6. Peak and A v erage Po wer Reduction for V arious Scheduling Schemes BenchPercentage a v erage data for v arious schemes mark DFC based Shiue [119 ] Martin [44 ] Raghunathan [47 ] Mohanty [48 ] Circuits s% r s%k s% r sV%k s% r s%ik s% r sV%k s% r s%ik EXP(1) 73 72 FIR(2) 71 72 63 N A 40 NO 23 38 71 53 IIR(3) 69 69 HAL(4) 71 71 28 N A 73 70 ARF(5) 73 71 50 N A 68 67 frequenc y is v aried onthey based on the functional units acti v e in that c ycle. In this clocking scheme, all the units are clock ed by a single clock line which switches at runtime. The frequenc y reduction creates an opportunity to operate the dif ferent functional units at dif ferent v oltages, which in turn, helps in further reduction of po wer In this case the objecti v e is to minimize the peak po wer consumption of the whole DFG o v er all control steps described in Eqn. 5.22 without e xplicitly considering the a v erage po wer minimzation. Thus the objecti v e function changes into the equation gi v en belo w .* EL L %ur N* : m@ mb <$ mb h9 C mb r & @z i (5.24) It should be noted that the% r kis an unkno wn which has to be minimized. It may be po wer consumption of an y control step in the DFG depending on the scheduled operations and hence is later used as a constraint. The constraints of the formulation, such as uniqueness constraints, precedence constraints, resource constraints, frequenc y constraints, and peak po wer constraints remains the same as before. 5.5.1.2 Multiple Supply V oltages and Multicycling (MVMC) In this subsection, we describe the ILP formulation for peak po wer minimization using multiple supply v oltages and multic ycling. In this scheme, the functional units are operated at multiple supply v oltages and the lo wer operating v oltage functional units are scheduled in consecuti v e control steps. In this case the objecti v e is to minimize the peak po wer consumption of the whole 131 PAGE 149 DFG o v er all control steps described in Eqn. 5.23 without e xplicitly considering the a v erage po wer minimization. Thus the ILP formulation becomes as the one gi v en belo w .* EL L % r N* : m@ gmb $mb 9 C mb r C & @z i (5.25) It should be noted that the% r kis an unkno wn which has to be minimized. It may be po wer consumption of an y control step in the DFG depending on the scheduled operations and hence is later used as a constraint. The constraints of the formulation, such as uniqueness constraints, precedence constraints, resource constraints, and peak po wer constraints remains the same as before. 5.5.2 ILPBased Scheduler In this section, we will discuss the solutions for the ILP formulations obtained in the pre vious section. The tar get architecture and characterised datapath components are from [55 ]. The ILP based scheduler which minimizes peak po wer consumption of the DFG has basically the same steps as the one presented for simultaneous peak and a v erage presented in Fig. 5.1. The rst step is to determine the as soon as possible (ASAP) time stamp of each operation. The second step is the determination of the as late as possible (ALAP) time stamp of each v erte x for the DFG. The ASAP time stamp is the start time and the ALAP time stamp is the nish time of each operation. These tw o times pro vide the mobility of an operation and the operation must be scheduled in this mobile range. This mobility graph needs to be modied for the MVMC scheme. Then the scheduler determines the ILP formulations based on the models described in Section 5.5.1. After the ILP formulation is solv ed (using LPSolv e) the scheduled DFG is obtained. The scheduler determines the c ycle frequencies for the scheduled DFG for the MVDFC scheme. 5.5.2.1 Scheduling f or MVDFC W e illustrate the solution for the ILP formulation in the MVDFC case, with the help of the DFG sho wn in Fig. 5.7. The ASAP schedule is sho wn in Fig. 5.7(a) and the ALAP schedule is sho wn in Fig. 5.7(b). From the ASAP and ALAP schedules we obtain the mobility graph as in 132 PAGE 150 0 2 5 6 7 4 Source Sink * + + + NOP NOP 3 c0 c1 c2 c3 c4 1 0 1 2 3 4 5 6 NOP NOP 7 * + + + Source Sink c0 c1 c2 c3 c4 1 2 4 3 5 6 (a) ASAP Schedule (b) ALAP Schedule * + + + Source 0 NOP 4 6 + + NOP Sink 5.0V (c) Mobility Graph (d) Final Schedule 2 1 3.3V 3.3V 5.0V 3 3.3V + 7 3.3V 5 Figure 5.7. Example DFG (for RC1) (MVDFC) Fig. 5.7(c). Using this mobility graph, we ha v e the ILP formulations sho wn in Fig. 5.8 for the resource constraint (RC1) : tw o multipliers atZ Z 9, one multiplier atA"9, one ALU atZ Z 9and one ALU operating atA"9. W e solv ed the formulation using LPsolv e and based on the results, we obtained the scheduled DFG sho wn is Fig. 5.7(d). In Fig. 5.8, we used the follo wing additional notations, PP : peak po wer ,*+"! : number of multipliers at v oltage le v el 1,*+"! }: number of multipliers at v oltage le v el 2,* !i: number of ALUs at v oltage le v el 1, and* !u: number of ALUs at v oltage le v el 2. The corresponding formulation e xpressed in AMPL [166 ] is gi v en in Fig. 5.9. 5.5.2.2 Scheduling f or MVMC W e illustrate solution for the ILP formulation of the MVMC case, with the help of the DFG sho wn in Fig. 5.10. The ASAP schedule is sho wn in Fig. 5.10(a) and the ALAP schedule is 133 PAGE 151 /* ILP F ormulation for Peak Po wer Minimization for MVDFC scheme */ /* Objecti v e Function */ min: PP; /* Uniqueness Constraints */ x1111 + x1112 + x1121 + x1122 + x1211 + x1212 + x1221 + x1222 = 1; x2111 + x2112 + x2121 + x2122 = 1; x3111 + x3112 + x3121 + x3122 + x3211 + x3212 + x3221 + x3222 = 1; x4211 + x4212 + x4221 + x4222 = 1; x5211 + x5212 + x5221 + x5222 + x5311 + x5312 + x5321 + x5322 = 1; x6311 + x6312 + x6321 + x6322 = 1; /* Precedence Constraints */ 3x6311 + 3 x6312 + 3 x6321 + 3 x6322 2 x1211 2 x1212 2 x1221 2 x1222 x1111 x1112 x1121 x11221; 2 x4211 + 2 x4212 + 2 x4221 + 2 x4222 x2111 x2112 x2121 x21221; 3 x6311 + 3 x6312 + 3 x6321 + 3 x6322 2 x4211 2 x4212 2 x4221 2 x42221; 3 x5311 + 3 x5312 + 3 x5321 + 3 x5322 + 2 x5211 + 2 x5212 + 2 x5221 + 2 x5222 2 x3211 2 x3212 2 x3221 2 x3222 x3111 x3112 x3121 x31221; /* Resource Constraints */ x1111 + x2111 + x3111 + x1112 + x2112 + x31121; /* Mmult1 */ x1121 + x2121 + x3121 + x1122 + x2122 + x31222; /* Mmult2 */ x1211 + x3211 + x1212 + x32121; /* Mmult1 */ x1221 + x3221 + x1222 + x32222; /* Mmult2 */ x4211 + x5211 + x4212 + x52121; /* Malu1 */ x4221 + x5221 + x4222 + x52221; /* Malu2 */ x5311 + x6311 + x5312 + x63121; /* Malu1 */ x5321 + x6321 + x5322 + x63221; /* Malu2 */ /* Frequenc y Constraints */ x1121 = 0; x1221 = 0; x2121 = 0; x3121 = 0; x3221 = 0; x4221 = 0; x5221 = 0; x5321 = 0; x6321 = 0; /* Peak Po wer Constraints */ 39.6 x1111 + 19.8 x1112 + 17.3 x1121 + 8.6 x1122 + 39.6 x2111 + 19.8 x2112 + 17.3 x2121 + 8.6 x2122 + 39.6 x3111 + 19.8 x3112 + 17.3 x3121 + 8.6 x3122PP; 39.6 x1211 + 19.8 x1212 + 17.3 x1221 + 8.6 x1222 + 39.6 x3211 + 19.8 x3212 + 17.3 x3221 + 8.6 x3222 + 1.0 x4211 + 0.5 x4212 + 0.5 x4221 + 0.2 x4222 + 1.0 x5211 + 0.5 x5212 + 0.5 x5221 + 0.2 x5222PP; 1.0 x5311 + 0.5 x5312 + 0.5 x5321 + 0.2 x5322 + 1.0 x6311 + 0.5 x6312 + 0.5 x6321 + 0.2 x6322PP; /* Inte ger Constraints */ INT x1111, x1112, x1121, x1122, x1211, x1212, x1221, x1222, x2111, x2112, x2121, x2122, x3111, x3112, x3121, x3122, x3211, x3212, x3221, x3222, x4211, x4212, x4221, x4222, x5211, x5212, x5221, x5222, x5311, x5312, x5321, x5322, x6311, x6312, x6321, x6322; Figure 5.8. ILP F ormulation for Example DFG (MVDFC) 134 PAGE 152 /* ILP F ormulation for Peak Po wer Minimization for MVDFC scheme */ param T ASK; # number of T asks param LEVEL; # number of le v els in DFG param V OL T ; # number of v oltage le v els param FREQ; # number of frequenc y le v els param ASAP 1..T ASK¡B¢0,LEVEL; #ASAP Schedule for each T ask param ALAP 1..T ASK¡B¢0,LEVEL; #ALAP Schedule for each T ask param OP 1..T ASK¡; #T ype of Functional Unit param PO WER 1..2, 1..V OL T 1..FREQ¡; #Po wer Consumption of each Functional Unit param M 1..2, 1..V OL T¡; #Resource Constraints v ar PP; v ar X i in 1..T ASK, j in ASAP[i]..ALAP[i], v in 1..V OL T f in 1..FREQ¡binary; #Objecti v e Function minimize peak po wer : PP; # Uniqueness Constraints subject to uniq cons i in 1..T ASK¡: sum j in ASAP[i]..ALAP[i], v in 1..V OL T f in 1..FREQ¡X[i, j, v f] = 1; # Precedence Constraints subject to pred cons1: sum j in ASAP[6]..ALAP[6], v in 1..V OL T f in 1..FREQ¡j X[6, j, v f] sum j in ASAP[1]..ALAP[1], v in 1..V OL T f in 1..FREQ¡j X[1, j, v f]1; subject to pred cons2: sum j in ASAP[4]..ALAP[4], v in 1..V OL T f in 1..FREQ¡j X[4, j, v f] sum j in ASAP[2]..ALAP[2], v in 1..V OL T f in 1..FREQ¡j X[2, j, v f]1; subject to pred cons3: sum j in ASAP[6]..ALAP[6], v in 1..V OL T f in 1..FREQ¡j X[6, j, v f] sum j in ASAP[4]..ALAP[4], v in 1..V OL T f in 1..FREQ¡j X[4, j, v f]1; subject to pred cons4: sum j in ASAP[5]..ALAP[5], v in 1..V OL T f in 1..FREQ¡j X[5, j, v f] sum j in ASAP[3]..ALAP[3], v in 1..V OL T f in 1..FREQ¡j X[3, j, v f]1; # Resource Constraints subject to res cons mult j in 1..LEVEL, v in 1..V OL T¡: sum f in 1..FREQ, i in 1..T ASK: ASAP[i]jALAP[i] && OP[i] = 2¡X[i, j, v f]M[2, v]; subject to res cons alu j in 1..LEVEL, v in 1..V OL T: sum f in 1..FREQ, i in 1..T ASK: ASAP[i]jALAP[i] && OP[i] = 1¡X[i, j, v f]M[1, v]; # Peak Po wer Constraints subject to pp cons j in 1..LEVEL¡: sum v in 1..V OL T f in 1..FREQ, i in 1..T ASK: ASAP[i]jALAP[i]¡PO WER[OP[i], v f] X[i, j, v f]PP; #Frequenc y Constraints subject to freq cons i in 1..T ASK, j in ASAP[i]..ALAP[i]¡: X[i, j, 2, 1] = 0; Figure 5.9. ILP F ormulation for Example DFG (MVDFC) in AMPL 135 PAGE 153 0 2 5 6 7 4 Source Sink* * + + +NOP NOP 3 c0 c1 c2 c3 c4 1 0 1 2 3 4 5 6 NOP NOP 7* * + + +Source Sink (b) ALAP Schedule (a) ASAP Schedule* * + + +1 2 3 4 5 6 NOP (d) Final Schedule (c) Mobility Graph NOP Source 0 7 Sink c1 c2 c3 c4 c5 c0+ + *2 4 5.0V 5*3 3.3V 3.3V 5.0V 6 5.0V+ *3.3V 1 Figure 5.10. Example DFG (for RC1) (MVMC) sho wn in Fig. 5.10(b). From the ASAP and ALAP schedules we obtain the mobility graph which is Fig.5.10(c). This mobility graph is dif ferent from that sho wn in Fig. 5.10(c). In the MVMC case, the mobility graph considers the multic ycle operations. W e assume tw o operating v oltage le v els, and when the multipliers are operated at lo wer v oltage, the y tak e tw o clock c ycles. F or the characterised cells used in our e xperiment [55 ], the operating clock frequenc y ,r C is7RD*+. Using this mobility graph, we ha v e the ILP formulations sho wn in Fig. 5.11 for the resource constraint (RC1), tw o multipliers atZ Z 9, one multipliers atA"9, one ALU atZ Z 9and one ALUs operating atA"9. The corresponding formulation e xpressed in AMPL [166 ] is gi v en in Fig. 5.12. W e solv ed the formulation using LPsolv e and based on the results we obtained the scheduled DFG sho wn is Fig. 5.10(d). In Fig. 5.11, the notations, such as, PP ,*W"! £,*W"! },* !and* !ha v e same meaning as that of the MVDFC case sho wn in Fig. 5.8. 136 PAGE 154 /* ILP F ormulation for Peak Po wer Minimization for MVMC scheme */ /* Objecti v e Function */ min: PP; /* Uniqueness Constraints */ x1212 + x1223 + x1111 + x1122 + x1133 = 1; x2212 + x2111 + x2122 = 1; x3111 + x3122 + x3133 + x3212 + x3223 = 1; x4122 + x4133 + x4222 + x4233 = 1; x5122 + x5133 + x5144 + x5222 + x5233 + x5244 = 1; x6133 + x6144 + x6233 + x6244 = 1; /* Peak Po wer Constraints */ 39.6 x1111 + 8.6 x1212 + 39.6 x2111 + 8.6 x2212 + 39.6 x3111 + 8.6 x3212PP; 39.6 x1122 + 8.6 x1212 + 8.6 x1223 + 39.6 x2122 + 8.6 x2212 + 39.6 x3122 + 8.6 x3212 + 8.6 x3223 + 1.0 x4122 + 0.5 x4222 + 1.0 x5122 + 0.5 x5222PP; 39.6 x1133 + 8.6 x1223 + 39.6 x3133 + 8.6 x3223 + 1.0 x4133 + 0.5 x4233 + 1.0 x5133 + 0.5 x5233 + 1.0 x6133 + 0.5 x6233PP; 1.0 x5144 + 0.5 x5244 + 1.0 x6144 + 0.5 x6244PP; /* Resource Constraints */ x1111 + x2111 + x31111; /* Mmult1 */ x1212 + x2212 + x32122; /* Mmult2 */ x1122 + x2122 + x31221; /* Mmult1 */ x1212 + x1223 + x2212 + x3212 + x32232; /* Mmult2 */ x1133 + x31331; /* Mmult1 */ x1223 + x32232; /* Mmult2 */ x4122 + x51221; /* Malu1 */ x4222 + x52221; /* Malu2 */ x4133 + x5133 + x61331; /* Malu1 */ x4233 + x5233 + x62331; /* Malu2 */ x5144 + x61441; /* Malu1 */ x5244 + x62441; /* Malu2 */ /* Precedence Constraints */ 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 3 x1133 3 x1223 2 x1122 2 x1212 x11111; 4 x6144 + 4 x6244 + 3 x6133 + 3 x6233 3 x4133 3 x4233 2 x4122 2 x42221; 3 x4133 + 3 x4233 + 2 x4122 + 2 x4222 2 x2122 2 x2212 x21111; 4 x5144 + 4 x5244 + 3 x5133 + 3 x5233 + 2 x5122 + 2 x5222 3 x3133 3 x3223 2 x3122 2 x3212 x31111; /* Inte ger Constraints */ INT x1111, x1122, x1133, x1212, x1223, x2111, x2122, x2212, x3111, x3122, x3133, x3212, x3223, x4122, x4133, x4222, x4233, x5122, x5133, x5144, x5222, x5233, x5244, x6133, x6144, x6233, x6244; Figure 5.11. ILP F ormulation for Example DFG (MVMC) 137 PAGE 155 /* ILP F ormulation for Peak Po wer Minimization for MVMC scheme */ param T ASK; # Number of T asks param LEVEL; # Number of Le v els in DFG param V OL T ; # Number of V oltage Le v els param ASAP 1..T ASK¡H¢0; #ASAP Schedule for each T ask param ALAP 1..T ASK¡H¢0; #ALAP Schedule for each T ask param OP 1..T ASK¡; #T ype of Functional Unit param M 1..2, 1..V OL T¡; #Resource Constraints param PO WER 1..2, 1..V OL T¡; #Po wer consumption of the Functional Unit v ar PP; v ar X i in 1..T ASK, v in 1..V OL T j in ASAP[i]..ALAP[i], k in ASAP[i]..ALAP[i]¡binary; #Objecti v e Function minimize peak po wer: PP; # Uniqueness Constraints subject to uniq cons i in 1..T ASK¡: sum j in ASAP[i]..ALAP[i]¡X[i, 1, j, j] + (if OP[i] = 2 then sum j in ASAP[i]..ALAP[i]1¡ X[i, 2, j, j+1] else sum j in ASAP[i]..ALAP[i]¡X[i, 2, j, j]) = 1; # Precedence Constraints subject to pred cons1: sum v in 1..V OL T j in ASAP[6]..ALAP[6]¡j X[6, v j, j] sum j in ASAP[1]..ALAP[1]¡j X[1, 1, j, j] sum j in ASAP[1]..ALAP[1]1¡(j+1) X[1, 2, j, j+1]1; subject to pred cons2: sum v in 1..V OL T j in ASAP[6]..ALAP[6]¡j X[6, v j, j] sum v in 1..V OL T j in ASAP[4]..ALAP[4]¡j X[4, v j, j]1; subject to pred cons3: sum v in 1..V OL T j in ASAP[4]..ALAP[4]¡j X[4, v j, j] sum j in ASAP[2]..ALAP[2]¡j X[2, 1, j, j] sum j in ASAP[2]..ALAP[2]1¡(j+1) X[2, 2, j, j+1]1; subject to pred cons4: sum v in 1..V OL T j in ASAP[5]..ALAP[5]¡j X[5, v j, j] sum j in ASAP[3]..ALAP[3]¡j X[3, 1, j, j] sum j in ASAP[3]..ALAP[3]1¡(j+1) X[3, 2, j, j+1]1; # Resource Constraints subject to res cons mult j in 1..LEVEL, v in 1..V OL T¡: if v = 1 then sum i in 1..T ASK: ASAP[i]jALAP[i] && OP[i] = 2¡X[i, 1, j, j] else sum i in 1..T ASK: ASAP[i]£j£ALAP[i] && OP[i] = 2¡(X[i, 2, j1, j] + X[i, 2, j, j+1]) + sum i in 1..T ASK: ALAP[i] = j && OP[i] = 2¡X[i, 2, j1, j] + sum i in 1..T ASK: ASAP[i] = j && OP[i] = 2¡X[i, 2, j, j+1]M[2, v]; subject to res cons alu j in 1..LEVEL, v in 1..V OL T¡: sum i in 1..T ASK: ASAP[i]jALAP[i] && OP[i] = 1¡X[i, v j, j]M[1, v]; # Peak Po wer Constraints subject to pp cons j in 1..LEVEL1¡: sum i in 1..T ASK: ASAP[i]jALAP[i]¡X[i, 1, j, j] PO WER[OP[i], 1] + sum i in 1..T ASK: ASAP[i]£j£ALAP[i] && OP[i] = 2¡(X[i, 2, j1, j] PO WER[OP[i], 2] + X[i, 2, j, j+1] PO WER[OP[i], 2]) + sum i in 1..T ASK: j = ALAP[i] && OP[i] = 2¡X[i, 2, j1, j] PO WER[OP[i], 2] + sum i in 1..T ASK: ASAP[i] = j && OP[i] = 2¡X[i, 2, j, j+1] PO WER[OP[i], 2] + sum i in 1..T ASK: ASAP[i]jALAP[i] && OP[i] = 1¡X[i, 2, j, j] PO WER[OP[i], 2]PP; Figure 5.12. ILP F ormulation for Example DFG (MVMC) in AMPL 138 PAGE 156 5.5.3 Experimental Results The ILP based MVDFC and MVMC schedulers were tested with v e benchmark circuits : Example circuit (e xp), FIR lter IIR lter HAL dif ferential equation solv er and AutoRe gressi v e lter (arf). The functional units used are ALUs and multipliers. The characterised datapath cells are used from [55 ]. The scheduling algorithms were tested using the dif ferent sets of resource constraints (RC1, RC2, RC3, RC4, RC5) sho wn in T able 5.7. The e xperimental results for v arious benchmark circuits are reported in T able 5.8 for both MVDFC and MVMC case. The po wer estimation includes the po wer consumption of the o v erheads, such as le v el con v erters (used in both MVDFC and MVMC schemes) and dynamic clocking units (needed for MVDFC case). It is assumed that each resource has equal switching acti vity (mb ). The results are reported for tw o supply v oltages and for switching N"#. T able 5.7. Resource Constraints used for our Experiment Resource Constraints Details Resource Multipliers ALUs Constraint 3.3 V 5.0 V 3.3 V 5.0 V Label 2 1 1 1 RC1 3 0 1 1 RC2 2 0 0 2 RC3 1 1 0 1 RC4 2 0 0 1 RC5 T o get a visual picture of the e xperimental results, we plotted the peak po wer reductions and the PDP reductions a v eraged o v er all resource constraints. Fig. 5.13 sho ws the a v erage reductions for dif ferent benchmarks a v eraged o v er all resource constraints. It is ob vious from the gure that the reductions are appreciable. It is observ ed that the po wer consumption increases for higher switching and decreases for lo wer switching acti vity The peak po wer reductions for the proposed scheduling schemes are listed alongwith other scheduling algorithms dealing with peak po wer reduction in T able 5.5.3. The table is not to pro vide an e xact comparison, b ut to pro vide a general idea of relati v e performances. 139 PAGE 157 T able 5.8. Po wer Estimates for MVDFC and MVMC Scheduling Schemes R Peak Po wer Estimate in¤¦ PDP Estimates in§@¨ C 7 m 7 m Nn Nn mNn Nn mNn 1 2 3 4 5 6 7 8 9 10 11 12 1 79.2 17.3 78.2 35.6 55.1 20.3 7.8 61.9 17.0 16.1 e 2 79.2 17.3 78.2 51.8 34.6 20.3 7.8 61.9 12.0 41.1 x 3 79.2 17.3 78.2 34.6 56.4 20.3 7.6 62.5 15.3 24.8 p 4 40.7 9.2 77.5 40.7 0 27.1 10.5 61.4 27.1 0 5 40.7 9.2 77.5 34.6 15.1 27.1 10.5 61.4 15.1 44.3 A v erage v alues 77.9 32.2 61.8 25.3 1 79.2 17.3 78.2 40.7 48.6 56.2 21.8 61.1 51.8 7.8 f 2 79.2 17.3 78.2 51.3 35.2 56.2 21.8 61.1 49.3 12.3 i 3 79.2 17.3 78.2 35.6 55.1 56.2 22.0 60.9 34.3 39.0 r 4 79.2 40.6 48.7 40.7 48.61 56.2 46.6 17.1 67.5 20.1 5 79.2 17.3 78.2 35.6 55.1 56.2 22.1 60.7 35.2 37.4 A v erage v alues 72.3 48.5 52.2 15.3 1 118.9 37.1 68.8 74.2 37.6 45.0 17.8 60.5 43.3 3.8 i 2 118.9 25.9 78.2 51.9 56.4 45.0 14.4 68.0 29.8 33.8 i 3 79.3 17.3 78.2 34.6 56.4 56.2 19.4 65.5 40.2 28.5 r 4 80.3 29.0 63.9 56.9 29.1 56.2 34.0 39.4 60.0 6.8 5 80.3 17.8 77.9 34.6 56.9 56.2 18.8 66.5 40.2 28.5 A v erage v alues 73.4 47.2 60.0 20.3 1 80.3 17.5 78.2 56.9 29.1 54.0 21.0 61.1 73.0 35.2 h 2 80.3 17.5 78.2 51.8 35.5 54.0 21.0 61.1 35.9 33.5 a 3 80.3 17.8 77.8 35.6 55.7 54.0 20.8 61.5 42.3 21.7 l 4 80.3 29.0 63.9 58.0 27.8 67.5 45.7 32.2 73.5 8.9 5 80.3 17.8 77.9 35.6 55.7 67.5 26.4 60.9 48.4 28.3 A v erage v alues 75.2 40.8 55.4 7.9 1 40.7 8.9 78.2 35.0 14.0 114.7 31.5 72.5 66.2 42.3 a 2 40.7 8.9 78.2 35.0 14.0 114.7 31.5 72.5 66.7 41.8 r 3 40.7 9.1 77.5 35.6 12.5 114.7 38.2 66.7 68.3 40.5 f 4 40.7 9.1 77.5 39.6 2.7 114.7 39.0 66.0 132.9 15.9 5 40.7 9.1 77.5 35.6 12.5 114.7 38.2 66.7 68.3 40.5 A v erage v alues 77.8 11.1 68.9 29.8 Ov erall A v erage 75.3 36.0 59.7 19.7 140 PAGE 158 1 2 3 4 5 0 20 40 60 80 Different benchmark circuits >Average peak power reduction (%) >MVDFC (a) 1 2 3 4 5 0 10 20 30 40 50 60 70 Different benchmark circuits >Average PDP reduction (%) >MVDFC (b) 1 2 3 4 5 0 10 20 30 40 50 Different benchmark circuits >Average peak power reduction (%) >MVMC (c) 1 2 3 4 5 0 5 10 15 20 25 30 Different benchmark circuits >Average PDP reduction (%) >MVMC (d) Figure 5.13. A v erage Reductions for Benchmarks T able 5.9. Po wer Reduction for V arious Scheduling Schemes % Estimated a v erage peak po wer reduction Benchmark This w ork Shiue Martin Raghunathan Circuits MVDFC MVMC [119 ] [44 ] [47 ] m {7 mNn {7 X m mNn X m X m (1)e xp 77.9 61.8 32.2 25.3 (2)r 72.3 52.2 48.5 15.3 63.0 40.3 23.1 (3)iir 73.4 60.0 47.2 20.3 (4)hal 75.2 55.4 40.8 7.9 28.0 (5)arf 77.8 68.9 11.1 29.8 50.0 141 PAGE 159 5.6 Conclusions Reduction of both peak po wer and a v erage po wer consumption of a CMOS circuit is important. This chapter addressed reduction of peak po wer and a v erage po wer at beha vioral le v el using lo w po wer datapath scheduling techniques. Datapath scheduling schemes, one using multiple supply v oltage and dynamic clocking and another using multiple supply v oltage and multic ycling ha v e been introduced. ILP based optimization techniques were used for the abo v e tw o modes of datapath operations. Signicant amount of peak and a v erage po wer reduction o v er the single supply v oltage and single frequenc y scenario could be achie v ed in both the cases by the proposed scheduling algorithm. The reductions attained in peak po wer a v erage po wer and po wer delay product by using combined multiple supply v oltage and dynamic frequenc y clocking were note w orthy The r esults clearly indicate that the dynamic fr equency cloc king is a better sc heme than the multicycling appr oac h for power minimization. 142 PAGE 160 CHAPTER 6 ENERGY AND TRANSIENT PO WER MINIMIZA TION In battery dri v en portable applications, the minimization of ener gy a v erage po wer peak po wer and peak po wer dif ferential are equally important to impro v e reliability and ef cienc y The peak po wer and peak po wer dif ferential dri v e the transient characteristics of a CMOS circuit. In this chapter we propose a frame w ork for simultaneous reduction of the ener gy and transient po wer dur ing beha vioral synthesis. A ne w parameter called Cycle Po wer Function (CPF) is dened which captures the transient po wer characteristics as an equally weighted sum of normalized mean c ycle po wer and normalized mean c ycle dif ferential po wer Minimizing this parameter using multiple supply v oltages and dynamic frequenc y clocking results in reduction of both ener gy and transient po wer [48 ]. The c ycle dif ferential po wer can be modeled either as the mean de viation from the a verage po wer or as the c ycletoc ycle po wer gradient. The switching acti vity information is obtained from beha vioral simulations. Based on the abo v e we de v elop a ne w datapath scheduling algorithm called CPFscheduler which attempts at po wer and ener gy minimization by minimizing the CPF parameter by the scheduling process. The type and number of functional units a v ailable becomes the set of resource constraints for the scheduler Experimental results indicate that the scheduler that minimizes CPF instead of con v entional ener gy or a v erage po wer as objecti v e function could achie v e signicant reductions in po wer and ener gy The rest of the chapter is or ganized as follo ws. The deri v ation of the$%^&function based on the tw o models are presented in section 6.1. The proposed scheduling algorithm are presented in section 6.2. The subsequent sections present the e xperimental results and conclusions. 143 PAGE 161 6.1 Cycle P o wer Function (CPF) In this section, we introduce the dif ferent notations and terminology required for dening the c ycle po wer function (CPF). The CPF must be dened such that it can capture simultaneously the a v erage po wer the peak po wer and the peak po wer dif ferential of the datapath. The peak po wer and peak po wer dif ferential determine the transient po wer characteristics of the circuit. Minimization of the CPF using multiple v oltages results in minimization of ener gy as well. The datapath is represented as a sequencing data o w graph (DFG). The notations and terminology needed for the proposed models are gi v en in T able 6.1. T able 6.1. List of Notataions and T erminology used in CPF Modeling p: total number of control steps in the DFG : total number of operations in the DFG ¤: a control step or a clock c ycle in the DFG m: an y operation, where' , % : the total po wer consumption of all functional units acti v e in control step¤ (c ycle po wer consumption) % r kh: peak po wer consumption for the DFG equal to.c% 0 & %: mean po wer consumption of the DFG (a v erage% o v er all control steps) %F 5x{ l: normalised mean po wer consumption of the DFG % : c ycle dif ference po wer (for c ycle¤; a measure of c ycle po wer uctuation) % r kh: peak dif ferential po wer consumption for the DFG equal toi.c %VB0 & %: mean of the c ycle dif ference po wers for all control steps in DFG % F 5x{ l: normalised mean of the mean dif ference po wers for all steps in DFG $'%'&: c ycle po wer function & M : an y functional unit of typeoperating at v oltage le v el> & M m: an y functional unit& M needed by mfor its e x ecution ( m 1& M ) & M mb : an y functional unit& M macti v e in control step¤ : total number of functional units acti v e in step¤ (same as the number of operations scheduled in¤) mb : switching acti vity of resource& M mb 96mb : operating v oltage of resource& M mb $mb : load capacitance of resource& M m r : frequenc y of control step¤ The CPF is dened to consist of tw o main components: the normalized mean c ycle po wer and the normalized mean c ycle dif ference po wer The normalized mean c ycle po wer (%oF 5x{ l) is the mean c ycle po wer (%) normalized with respect to the peak po wer consumption (% r k) of the 144 PAGE 162 DFG. The normalized mean c ycle dif ference po wer ( %)F 5x{ l) is the mean c ycle dif ference po wer (Y%) normalized with respect to the peak po wer dif ferential of the DFG. The second component v aries between the tw o models. The mean dif ference po wer is the mean of the c ycle dif ference po werY% o v er the control steps. In model 1, the c ycle dif ference po wer % is dened as the absolute de viation of the c ycle po wer from the mean c ycle po wer Then, the mean c ycle dif ference po werY%is the mean de viation of the c ycle po wer from the mean c ycle po wer On other hand, in model 2, the c ycle dif ference po wer % of a current c ycle is modelled as the c ycletoc ycle po wer gradient. In other w ords, the c ycle dif ference po werY% of a current control step¤is the dif ference (or gradient) of the current c ycle po wer and the pre vious c ycle po wer This can be e xpressed mathematically as,Y% % d% @or % @P % @d% In this case, the mean c ycle dif ference po wer %is the mean dif ference (or the gradient). The tw o models are further elaborated and used in dening the CPF 6.1.1 Model 1 : CPF using Mean De viation F or a set ofobserv ations,i@7:BvC:4EEEE:BvFfrom a gi v en distrib ution, the sample mean (which is an unbiased estimator for the population mean,') is @ F F m@ m. The absolute de viation of these observ ations is dened assm vmd !. The mean de viation of the observ ations is gi v en by*+ @ F F mE@ vmgd!. In this case, we model the c ycle dif ference po werY% as the absolute de viation of c ycle po wer% from the mean c ycle po wer%. Similarly the mean dif ference po werY%is modelled as mean de viation of the c ycle po wer% The mean c ycle po wer%is an unbiased estimate of the a v erage po wer consumption of the DFG. The po wer consumption for an y control step¤is gi v en by Eqn. 6.1. This is the total po wer consumption of all functional units acti v e in control step¤. This also includes the po wer consumption of the le v el con v erters where the le v el con v erters are considered as resources operating in a c ycle¤, if the current resource is dri v en by a resource operating at lo wer v oltage.% : m@ gmb $mb 9 C mb r (6.1) 145 PAGE 163 The peak po wer consumption of the DFG is the maximum po wer consumption o v er all thepcontrol steps which can be e xpressed as belo w .% r k % & @B C£ i : m@ mb $mb 9 C m PAGE 164 The mean c ycle dif ference po wer ( %) is calculated as the sample mean ofY% This is a measure of the po wer spread or distrib ution of the c ycle po wer o v er all control steps of the DFG. % @ i i @ % @ i i @ %+d% @ i i @ @ i ji @ : m@ mb <$ mb h9 C mb r d! : m@ mb <$ mb h9 C mb r (6.7) The normalised mean c ycle dif ference po wer (Y%)F 5x{ l) can be written as gi v en belo w .Y% F 5x{ l n n b§ x # r r z r x # r r z r lok< # yx # r r {z r ¦x # r r Gz r w (6.8) The abo v e normalised mean c ycle dif ference po werY%)F 5z{ lis a unitless quantity in the range [0,1]. The c ycle po wer function$'%'&which is modelled as the equally weighted sum of the nor malized mean c ycle po wer (%F 5x{ l) and the normalized mean c ycle dif ference po wer (Y%TF 5x{ l) is gi v en belo w .$%^&V.c%F 5x{ l1: PAGE 165 6.1.2 Model 2 : CPF using CycletoCycle Gradient F or a set@7:BuC:4EEEE:BvFofobserv ations from a gi v en distrib ution, the observ ationtoobse rv atio n gradient can be dened as, m @ d[ m , where d+. The mean gradient is gi v en by@ F @ F @ m@ vm @dHmh. It should be noted that there are dgradients forobserv ations. In this case, we model the c ycle dif ference po wer %)as the c ycletoc ycle po wer gradient and the mean dif ference po wer %as the mean gradient. The models for the mean c ycle po wer or the a v erage po wer (Eqn. 6.1 6.3) remains the same as before. The c ycle dif ference po wer (Y% ) for an y control step is dened as the dif ference in the po wer consumption of the current to the pre vious control step, as gi v en belo w .Y% @ % @Td% : m@ mb @$m @h9 C mb @ r @d! : m@ gmb $mb 9 C mb r (6.12) The peak dif ferential po wer is characterized by ( % r kh) : % r k % @Td% & @B C£ i @ : m@ m @h$mb @9 C mb @ r @d : m@ mb $mb 9 C mb r & @B C£ i @(6.13) The mean c ycle dif ference po wer (Y%) is calculated as, % @ i @ i @ @ % @ @ i @ i @ @ % @d% @ i @ ji @ @ : m@ mb @ $ m @ 9 C mb @ r @ d! : m@ mb h$ mb <9 C mb r (6.14) The normalised mean c ycle dif ference po wer (Y%)F 5x{ l) can be written as gi v en belo w . % F 5x{ l n n b§ # x r r Gz r x r r az r lk< x # r r {z r x # r r Gz r u r z/r (6.15) 148 PAGE 166 Using Eqn. 6.4 and 6.15, the c ycle po wer function ($%^&) can be written as follo ws.$%'& N%F 5x{ l OQ %F 5x{ l n n b§ O n n § # n n b§ O ¦ n n n b§ x r r z r lok< yx # r r Gz r O # x # r r z r x # r r z r lkh x r r Gz r ¦x r r az r #r z/r (6.16) The abo v e function (Eqn. 6.11 or 6.16) can be used as the objecti v e function for lo w po wer datapath scheduling. The minimization of this objective function using multiple supply volta g es, dynamic fr equency cloc king and multicycling will lead to the r eduction of ener gy and power par ameter s. From the equations, 6.10, 6.11, and 6.16 we mak e the follo wing observ ations about the c ycle po wer function ($'%'&). The$'%'&is a nonlinear function. It is a function of four parameters, such as, a v erage po wer (%), peak po wer (% r k), a v erage dif ference po wer ( %) and peak dif ference po wer (Y% r k). Each of the abo v e po wer parameters are dependent on switching acti vity capacitance, operating v oltage and operating frequenc y The absolute function (<3£or^) in the numerator (of Eqn. 6.11 or 6.16) contrib utes to the nonlinearity The comple x beha vior of the function is also contrib uted by the denominator parameters,% r khandY% r k. The po wer models e xpressed in equation 6.16 and 6.11 for the$%^&use generic parameters, such asmb :$mb :9mb andr The intention of using such paramaters is to mak e the$%'&model a general one, independent of an y specic ener gy or po wer models. It can accomodate both the lookup table based ener gy (po wer) models and ener gy (po wer) macromodels. The generic model can also help in easy inte gration of the$'%'&model in a beha vioral synthesis tool that uses both beha vioral po wer estimator and datapath scheduler Using the dynamic ener gy model proposed in [51 ], we can e xpress the ef fecti v e switching capacitance of our proposed model as,m/$mg +$tw/ m .2m @ :hm C 0(6.17) 149 PAGE 167 Here, themand$mare the parameters corresponding to the functional unit& M m. The$w/ mis a measure of the ef fecti v e switching capacitance of resource (functional unit)& M m, which is a function ofm @andm C; wherem @andm Care the a v erage switching acti vity v alues on the rst and second input operands of resource& M m. It should be noted that the abo v e switching model (in Eqn. 6.17) handles input pattern dependencies. Moreo v er the generic$%'&model can be easily tuned to handle an y of the four modes of datapath circuit operation, such as, (i) single supply v oltage and single frequenc y (ii) multiple supply v oltages and single frequenc y (iii) multiple supply v oltages and dynamic frequenc y and (i v) multiple supply v oltage and multic ycling. F or e xample, for single supply v oltage and single frequenc y scheme,9 mb andr are same for all¤, for multiple supply v oltage and multic yclingr is same for all¤. Using Eqn. 6.17 we re write Eqn. 6.11 as,$%'& #yx # ¨ r z r lkh x # ¨ r Gz r O # # yx ¨ r z r ¦x ¨ r z r lkh x ¨ r Gz r x # ¨ r {z r (6.18) Similarly using Eqn. 6.17 we re write Eqn. 6.16 as,$%'&+ # yx # ¨ r z r lk< x # ¨ r z r w O # x # ¨ r z r ¦x ¨ r z r lok< x ¨ r z r x ¨ r z r u  (6.19) The notation$w mb represents$w/ mfor the functional unit& M macti v e in control step¤. The abo v e tw o function (Eqn. 6.18 and Eqn. 6.19) are used as objecti v e functions for our scheduling algorithm.m @andm Care estimated using beha vioral simulation of a DFG [167 168 169 ]. A lookup table constructed to store the$ wv alues for dif ferent combinations of ( @and C) for dif ferent types of functional units, such as multipliers and ALUs. W e use interpolation technique to determine the$ wv alues for the ( @and C) combinations that are not a v ailable in the lookup table. 6.2 CPFScheduler Algorithm In this section, we de v elop a scheduling algorithm that minimizes the objecti v e functions (Eqn. 6.18 or 6.19) using multiple v oltages and dynamic clocking to reduce ener gy and the po wers. 150 PAGE 168 W e assume the a v ailability of dif ferent functional units operating at dif ferent supply v oltages. In dynamic frequenc y clocking or frequenc y scaling, all the units are clock ed by a single clock line which can switch frequencies at runtime [60 62 63 ]. In such systems, a dynamic clocking unit (DCU) generates dif ferent clocks using a clock di viding strate gy It should be noted that frequenc y scaling helps in reducing po wer b ut not ener gy Moreo v er the frequenc y reduction f acilitates the the operations of the dif ferent functional units at dif ferent v oltages, which in turn helps in ener gy reduction. The tar get architecture model assumed for the scheduling is from [65 ]. Each functional unit is associated with a re gister and a multiple xor The re gister and the multiple xor will operate at the same v oltage le v el as that of the functional units. Le v el con v erters are used when a lo wv oltage functional unit is dri ving a highv oltage functional unit [65 95 ]. A controller decides which of the functional units are acti v e in each control step and those that are not acti v e are disabled using the multiple xors. The controller will ha v e a storage unit to store the c ycle frequenc y inde x (¤ ru ) v alues obtained from the scheduling, used as the clock di viding f actor for the dynamic clocking unit. The c ycle frequenc yr is generated dynamically and a corresponding functional unit is acti v ated. The delay for a control step is dependent on the delays of the functional units (6uDGF), multiple xor (6 O A7), re gister (6 : ) and le v el con v erters (6 5 F ) as e xpressed in follo wing equation.6 6EDGF Of6 O A7Of6 : tOf6 5 F (6.20) where,6 is the delay of control step¤,6 PAGE 169 Input : UDFG, resource constraints, , , all96m1s96n<,6UDGF,6 O A7,6 : ,6 5 F Output : scheduled DFG,r¢ k£w ,p,¤ ru po wer ener gy and delay estimates Step 1 : Calculate the switching acti vity at the inputs of each node through beha vioral simulation of the DFG. Step 2 : Construct a lookup table of ef fecti v e switching capacitance, switching acti vity pairs. Step 3 : Find ASAP and ALAP schedules of the UDFG. Step 4 : Determine the number of multipliers and ALUs at dif ferent operating v oltages. Step 5 : Modify both ASAP and ALAP schedules obtained in Step 1 using the number of resources found in Step 2 as initial resource constraint. Step 6 : Calculate the total number of control steps which is the maximum of ASAP and ALAP schedules from Step 5. Step 7 : Find the v ertices ha ving nonzero mobility and v ertices with zero mobility Step 8 : Use the CPFScheduler Heuristics to assign the time stamp and operating v oltage for the v ertices, and the c ycle frequencies such that$%'&and time penalty are minimum. Step 9 : Find base frequenc yr¢ k£w and c ycle frequenc y inde x¤ ru Step 10 : Calculate po wer ener gy and delay details. Figure 6.1. The CPFScheduler Algorithm Flo w where,6 lTmF is the minimum of the control step delays and is the number of allo w able frequencies. The v alue ofis chosen in such a w ay that¤ ru is closest v alue greater than or equal to fB f . The inputs to the algorithm are an unscheduled data o w graph (UDFG), the resource constraints, the number of allo w able v oltage le v els ( ), the number of allo w able frequencies ( ), delay of each resource (6 DGF), multiple xor (6 O A7), re gister (6 : ) at dif ferent v oltage le v els. The delays of le v el con v erters (6 5 F ) are represented in the form of a matrix that sho ws the delay for con v erting one v oltage le v el9umto another v oltage le v el9A(where, both96mx:9 1N9n). The resource constraint includes the number of ALUs and multipliers at dif ferent v oltage le v els9m(where,96mB196n<). The scheduling algorithm determines the proper time stamp for each operation,r¢ k£w ,¤ ru and the v oltage le v el such that the objecti v e function in Eqn. 6.18 or 6.19 as well as the time penalty is minimum. T o reduce the time penalty the lesser ener gy consuming resources are used at as maximum frequenc y as possible. The CPFScheduler : The o w of the proposed algorithm is outlined in Fig. 6.1. In step 1, the switching acti vities at the inputs of each node are determined using beha vioral simulation of the DFG. F or this purpose, dif ferent sets of application specic input v ectors (ha ving dif ferent 152 PAGE 170 CPFSc heduler HeuristicJ(01) Initialize CurrentSchedule as modied ASAPSchedule ; (02) while( all mobile v ertices are not time stamped ) do (03)J(04) for the CurrentSchedule (05)J(06) if (>mis a multiplication ) then Find the lo west a v ailable v oltage for multipliers; (07) if (>mis add/sub/comparison ) then Find the highest a v ailable operating v oltage for ALUs; (08)_/* end for (04) */ (09) Find$'%'&for CurrentSchedule and denote is as Current$'%'&; (10) Find ¡for CurrentSchedule and denote is as Current ¡; (11) Maximum +dXl; (12) for each mobile v erte x> m(13)J(14)¤t CurrentSchedule[>m];¤4 ALAPSchedule[>m]; (15) for¤o ¤to¤4in steps of 1 (16)J(17) Find a T empSchedule by adjusting CurrentSchedule in which> mis scheduled in step¤; (18) Find ne xt higher operating v oltage for multiplication v erte x for the T empSchedule (ne xt lo wer for ALU operation) ; (19) Find$'%'&for T empSchedule, denoted by T emp$%'&; (20) Find ¡for T empSchedule, denoted T emp ¡; (21) Dif ference (Current$'%'&+OCurrent ¡)d(T emp$'%'&NOT emp ¡) ; (22) if ( Dif ferenceXMaximum ) then (23)J(24) Maximum = Dif ference ; (25) CurrentV erte x =>m; (26) CurrentCycle =¤; (27) CurrentV oltage = Operating v oltage of> m(28)_/* end if (22) */ (29)_/* end for (15) */ (30)_/* end for (12) */ (31) Adjust CurrentSchedule to accomodate CurrentV erte x in Currentc ycle operating at v oltage assigned abo v e ; (32)_/* end while (02) */_/* End CPFScheduler Heuristic */ Figure 6.2. The CPFScheduler Algorithm Heuristic 153 PAGE 171 correlations) are gi v en at the primary inputs of the DFG and the a v erage swtiching acti vity at each node is calculated [167 168 169 ]. In step 2, the scheduler constructs a lookup table with ef fecti v e switching capacitance and the a v erage switching acti vity pair as described in Eqn. 6.17. The size of the lookup table impact the accurac y of the results. If the lookup table is lar ge enough to contain the switching capacitance for all estimated a v erage swtiching acti vities in step 1, then the po wer model accurac y is the highest. The scheduler uses interpolation techniques to nd the switching capacitance for a pair of input a v erage swtiching acti vity that does not e xist in the lookup table. The algorithm determines the assoonaspossib le (ASAP) and the aslateaspossibl e (ALAP) schedules for the UDFG in step 3. The ASAP schedule is unconstrained and the ALAP schedule uses the number of clock steps found in the ASAP schedule as the latenc y constraint. In step 4, the number of resources of each type and v oltage le v els is determined. F or e xample, if the resource constraint ismultiplier atA9,multipliers atZ Z 9,ALUs atA9andZALUs atZ Z 9, then the relax ed v oltage initial resource constraint is found out to beZmultipliers andALUs. In step 5, the scheduler uses the abo v e relax ed v oltage resource constraints and modies the ASAP and ALAP schedules to tak e into account the resource constraints. This helps in restricting the mobility of v ertices to a great e xtent and reducing the solution search space for the heuristic. Due to the resource constraints the number of control steps of modied ASAP and modied ALAP may be dif ferent from that of the ASAP and ALAP schedule in step 3. In step 6, the scheduler x es the total number of control steps of the schedule which is the maximum of the control steps of the modied ASAP or modied ALAP in step 5. In step 7, the v ertices are mark ed as ha ving zero mobility or nonzero mobility The zero mobility v ertices are those ha ving same modied ASAP time stamp and modied ALAP time stamp, and nonzero mobility v ertices are those ha ving dif ferent modied ASAP and modied ALAP time stamp. On determining the v ertices ha ving zero mobility and v ertices ha ving nonzero mobility proper time stamp and operating v oltage for mobile v ertices, and operating v oltages for nonmobile v ertices are found out. Further operating clock frequencies are established such that the$%'&as well as the time penalty is minimum. The CPFScheduler uses an heuristic algorithm for the same. In step 9, the scheduler determines the base frequenc y (r¢ kw ) and c ycle frequenc y inde x (¤ ru ) using Eqn. 6.21. In step 10, the scheduler 154 PAGE 172 calculates the peak po wer a v erage po wer peak po wer dif ferential, ener gy estimates of the scheuled DFG and also the critical path delay The CPFScheduler Heuristic : Fig. 6.2 sho ws the heuristic algorithm used by the CPFScheduler The inputs to the CPFScheduler heuristic are modied ASAP time stamp of each v erte x ( m), the modied ALAP time stamp of each v erte x (;m), the resource constraints, the number of allo w able v oltage le v els ( ), the number of allo w able frequencies ( ). Delay of each functional unit (6UDGF), multiple xor (6 O A7), re gister (6 : ) at dif ferent v oltage le v els are also gi v en as inputs. Delays of le v el con v erters (6 5 F ) is represented in the form of a matrix. The heuristic has to nd time stamp¤(in the range [ m :<; m]) and operating v oltage9 m for each v erte x> mwith operation m. The aim of the heuristic is to minimize$%'&as described in Eqn. 6.18 and 6.19 while k eeping time penalty at a minimum. The heuristic minimized time ratio ¡alongwith$'%'&to minimize the time penalty The time ratio ( ¡) is dened as the ratio between the critical path delay when the v ertices of the DFG are operating at multiple v oltage ( ) and when each of the v ertices of the DFG is operated at the highest v oltage. Expressing mathematically , ¡= ¡ v ¡ t. These tw o objecti v es, minimization of$%'&(minimization of ener gy and po wer) and minimization of time penalty are mutually conicting. This is due to the f act that if operating v oltage is reduced to minimize ener gy / po wer consumption this results in increase of critical path delay and hence increase of time penalty The heuristic operates the ener gy hungry functional units at the highest possible v oltage (frequenc y) and the less ener gy consuming functional units at lo west v oltage (frequenc y) to achie v e the simultaneous minimization of the mutually conicting objecti v es. The heuristic x es operating v oltages of the nonmobile v ertices as per this order depending on the types of resource the y need. The heuristic attempts to nd suitable time stamp and operating v oltage for the mobiles v ertices using e xhausti v e search. The mobilesv ertices are attempted to be placed in each of the time stamps within their mobile range ([ m :<; m]), when each placement and v oltage assignment is done, the$'%'&and ¡v alue is calculated. The predecessor and successor time stamps are adjusted accordingly to maintain the precedence. F or this purpose the heuristic maintains a matrix of dimension (p 1 g9vn<) ha ving number of resources of dif ferent types () as entries ro wwise o v er all control steps. The gis the type of resources a v ailable, for e xample, if only multiplier and 155 PAGE 173 ALUs are the a v ailable resources then the g . If a v oltage is assigned for a v erte x, then the matrix entry of the corresponding type and operating v oltage is decremented. A particular v erte x is placed in a c ycle for which the sum of$'%'&and ¡is minimum. The heuristic, initially assumes the modied ASAP schedule (with relax ed v oltage resource constrained) as the current schedule (line 01). In case a v erte x is a multiplication operation, then the initial v oltage assignment is the minimum a v ailable operating depending on the number of multipliers, whereas, for ALU operations v erte x, it is the maximum a v ailable operating v oltage (line 0408). Then the$%'&and ¡v alue for the current schedule is calculated (line 09 and line 10). The heuristic nds$'%'&(and ¡) v alues for each allo w able control step of each mobile v ertices and for each a v ailable operating v oltages denoted as T emp$%'&(and T emp ¡) (line 1720). The statement in line 17 adjusts the current schedule by adjusting the time stamps of successor v ertices while maintaining the resource constraint (using the matrix) and guaranting that the precedence is satised. In line 12, the v ertices are visited in ASAP manner Another possible w ay of visiting the mobile v ertices is to prioritise them in some manner say v erte x with lo wer mobility is visited rst. The heuristic x es the time step and operating v oltage for a v erte x and hence c ycle frequenc y for which$%^&NO ¡is minimum (line 2226). F or$%'&computation the heuristics uses@ fBas a temporary measure forr The abo v e steps are repeated until all mobile v ertices are time stamped. T ime comple xity of CPFScheduler Heuristic : Let there be9 number of v ertices in the DFG, out of which96lnumber of v ertices ha v e mobility and the maximum mobility of an y mobile v erte x is}l. It should be noted that the total number of v ertices in the DFG is total number of operations in DFG and the total number of NOOPs. The running time of nding an operating v oltage from the matrix for particular type of operation isV. 0. The statements from line 0408 ha v e running time ofR.<9 0. The w orst case running time of the statement in line 17 (or line 31) that adjusts the current schedule is.<9 l 0. The running time of the code se gment between line 1728 is.<96l0O=V. 0oORV.<9 0tOR.<9 0, which isR.<9 0, since it is al w ays true that96l: 9 . So, the running time of the code se gment from line 1529 isR.bl)9 0. Thus, the running time of the code se gment line 1230 isR.bl)96l)E9 0. The other statements of the pseudocode ha v e constant running time. So, the running time or time comple xity of the code 156 PAGE 174 se gment in line 0329 isR.<9 E 0TOkR.bxl9lE9 0OWV.<96l)0. This can be simplied to an weak upper bound on w orst case running of the code se gment (line 0331) under the assumption that96l 9Y, b ut in practice9l ' 9Y. Under the abo v e assumption we conclude that the w orst case upper bound on the running time of the code se gement in line 0331 isR.bl9Y C 0. Considering the while loop in line 02 the o v erall running time of the algorithm can be written asR.bxl9 C 9l)0. Again under the assumption that9ul)i9Y, we conclude that the w orst case upper bound on the running time of the algorithm isR.bhl9 0. In other w ords, the heuristic runs in time cubic to the number of vertices in the DFG It can be noted that the time comple xity of the algorithm is independent of the number of oper ating volta g e le vels. 6.3 Experimental Results The CPFScheduler algorithm w as implemented in C and tested with selected benchmark cir cuits. The benchmarks used are gi v en belo w .3AutoRe gressi v e lter (ARF) (total 28 nodes, 16*, 12+, 40 edges).3BandP ass lter (BPF) (total 29 nodes, 10*, 10+, 9, 40 edges).3DCT lter (total 42 nodes, 13*, 29+, 68 edges).3EllipticW a v e lter (EWF) (total 34 nodes, 8*, 26+, 53 edges).3FIR lter (total 23 nodes, 8*, 15+, 32 edges).3HAL dif ferential equation solv er (total 11 nodes, 6*, 2+, 2, 1, 16 edges). Our algorithm can handle lar ge DFGs and nd solutions in reasonable time. The parameters used to e xpress our e xperimental results are sho wn in T able 6.2. W e use a lookup table method as discussed in Section 6.1 for a v erage switching capacitance calculation. The lookup table construction consists of tw o phases, such as input pattern generation and cell characterization. W e generate the primary input signals of dif ferent correlations, using the autore gressi v e mo ving a v erage (ARMA) model [169 ]. W e perform the characterization of the 157 PAGE 175 T able 6.2. Notations used to Express the Results ; : total ener gy consumption assuming single frequenc y and single supply v oltage ; : total ener gy consumption for dynamic clocking and multiple supply v oltage % r : peak po wer consumption for single frequenc y and single supply v oltage % r : peak po wer consumption for dynamic clocking and multiple supply v oltage %l : minimum po wer consumption for single frequenc y and single supply v oltage %l : minimum po wer consumption for dynamic clocking and multiple supply v oltage : e x ecution time assuming single frequenc y : e x ecution time assuming dynamic frequenc y sV;: total ener gy reduction t v t sV%: a v erage po wer reduction wt .z¡ t wv .z¡ v t .z¡ t sV% r: peak po wer reduction n t n v n t sVY%: dif ferential po wer reduction n t n t n v n v n t n t ¡: time ratio ¡ v ¡ t physical implementations of the library modules a v ailable in [55 ] by applying the input patterns generated as abo v e for the v alues of ( m @ :h m C) pairs in the table. W e used interpolation to nd the a v erage switching capacitance for an y of (m @ :hgm C) pairs that do not e xist in the lookup table. It should be noted that lar ger the size of lookup table, better is the accurac y Our lookup table has 100 pairs of entries for (im @ :hm C). The signals are propagated through dif ferent operators in the DFG and the a v erage switching acti vities are calculated as described in [169 ] for each node. Our rst set of e xperiments were carried out for the$%'&model 1 (Eqn. 6.18) in which the c ycle dif ference po wer is based on the absolute de viation. W e tested the scheduling algorithm using the follo wing sets of resource constraints (RC1, RC2, RC3, RC4). Number of multipliers :atA9and Number of ALUs :atZ Z 9 Number of multipliers :atA9and Number of ALUs :atZ Z 9 Number of multipliers :atA9and Number of ALUs :atA9andatZ Z 9 Number of multipliers :atA9andatZ Z 9; Number of ALUs :atA9andatZ Z 9 The sets of resource constraints w as chosen so as to co v er resources at dif ferent operating v oltages. The number of allo w able v oltage le v els w as assumed to be tw o (A9: Z Z 9) and maximum number of allo w able frequencies are three. The CPFscheduler determines the frequencies, in this case the y are (D*+:hSA"* 6:77RA"*+). The e xperimental results for dif ferent benchmarks are 158 PAGE 176 T able 6.3. Po wer Estimates for Dif ferent Benchmarks (using Model 1) C Po wer reduction details, Ener gy sa vings, Number of clock c ycles and T ime penalty K R %ur %ur s%vr % l % l s % sV% s; p ¡ T C .b 0 .b 0 (%) .b 0 .b 0 (%) (%) (%) 1 2 3 4 5 6 7 8 9 10 11 12 1 9.30 2.83 69.60 0.26 0.52 74.50 71.40 47.57 18 1.6 A 2 18.33 4.77 73.96 0.26 0.52 76.47 68.30 47.57 13 1.4 R 3 18.59 4.84 73.96 0.26 0.52 76.44 71.72 49.87 11 1.5 F 4 18.59 7.26 60.96 0.26 0.52 63.25 59.10 29.49 11 1.5 A v erage v alues 69.62 72.67 67.63 43.62 1.5 1 9.30 2.45 73.62 0.26 0.52 78.64 65.80 46.69 17 1.3 B 2 18.33 4.20 77.10 0.26 1.67 86.03 58.81 46.69 17 1.2 P 3 18.59 4.84 73.96 0.52 0.97 78.59 71.09 48.61 9 1.4 F 4 18.59 7.33 60.60 0.52 0.97 64.84 64.01 32.02 9 1.4 A v erage v alues 71.32 77.02 64.93 43.50 1.3 1 9.30 2.83 69.60 0.26 0.52 74.50 50.90 42.44 29 1.1 D 2 9.30 2.83 69.60 0.26 0.52 74.50 50.90 42.44 29 1.1 C 3 18.59 4.84 73.96 0.26 0.40 75.75 67.70 42.93 15 1.4 T 4 18.59 7.61 59.05 0.26 0.40 60.63 65.19 38.49 15 1.4 A v erage v alues 68.05 71.35 58.67 43.58 1.2 1 9.30 2.45 73.62 0.26 0.52 78.64 41.17 44.43 27 0.9 E 2 18.07 4.07 77.49 0.26 0.52 80.09 37.49 44.43 27 0.9 W 3 18.07 4.07 77.49 0.26 0.40 79.38 57.89 44.73 16 1.2 F 4 18.07 6.55 63.75 0.26 0.40 65.49 53.10 38.45 16 1.2 A v erage v alues 73.09 75.90 47.41 43.01 1.1 1 9.30 2.74 70.52 0.26 0.52 75.45 58.54 46.11 15 1.3 F 2 9.30 2.74 70.52 0.26 0.52 75.45 58.54 46.11 15 1.3 I 3 18.59 4.77 74.32 0.26 0.40 76.12 51.21 46.77 11 1.0 R 4 18.59 7.04 62.15 0.24 0.40 63.77 40.69 27.21 11 1.2 A v erage v alues 69.38 72.70 52.25 41.55 1.2 1 9.30 2.45 73.62 0.26 1.67 91.38 72.32 50.58 7 1.6 H 2 18.33 4.49 75.53 0.26 1.67 84.44 64.70 50.58 5 1.4 A 3 18.33 4.49 75.53 0.52 0.97 80.27 72.48 51.84 4 1.5 L 4 18.33 6.97 61.98 0.52 0.97 66.32 57.14 25.00 4 1.5 A v erage v alues 71.67 80.60 66.66 44.50 1.5 A v erage v alues 70.52 75.04 59.59 43.29 1.3 159 PAGE 177 sho wn in T able 6.3 for dif ferent resource constraints. The a v erage results is sho wn in Fig. 6.7 for visual inspection. The results tak e into account the po wer or ener gy consumptions in o v erheads, such as le v el con v erters and dynamic clocking unit. This indicates that the scheduling scheme could achie v e signicant reductions in peak po wer peak po wer dif ferential, a v erage po wer and total ener gy with reasonable time penalties. The time penalty for the benchmarks circuits (ARF and HAL) were relati v ely high. F or many cases, CPFSc heduler could r educe ener gy and power e ven without any time penalty or e ven with gain in time This happens when the performance de gradation due to multiplications in the critical path are adequately compensated by the number of ALU operations in the critical path. F or this to happen, the ALU operations should be lar ger than or equal to the number of multiplications in the critical path. This is the case for most of the schedules obtained for the EWF and FIR benchmarks indicated by the time ratio ( ¡) of less than or equal to one. F or the abo v e e xperimental set up, we plotted the po wer consumption per c ycle, o v er all the control steps (clock steps) for dif ferent benchmarks in Fig. 6.3,6.4, 6.5 and 6.6 for resource constraints RC1, RC2, RC3 and RC4, respecti v ely The curv es labeled as S correspond to the prole when the schedule is operated at a single frequenc y (which is the maximum frequenc y of the slo west operator the multiplier) and single v oltage. The proles labeled as D correspond to the case when dynamic clocking and multiple v oltage scheme are used. The ef fecti v eness of the proposed scheduling scheme is ob vious from the gures. Since the$%'&is a comple x function consisting of se v eral parameters, it is dif cult to accurately quantify the impact of a specic parameter W e also performed e xperiments with three v oltage le v els (9:hA9: Z Z 9) and four frequenc y le v els. The results could impro v e within the range ofd[4"in terms of po wer or ener gy reductions. Ho we v er the time penalty increased by7. It is to be noted that the number of allo w able frequenc y le v els should be as close to the number of allo w able v oltages in order to k eep the time penalty within a reasonable limit. W e performed the same set of e xperiments for the CPF model 2 (Eqn. 6.19) in which the c ycle dif ference po wer is modeled as c ycletoc ycle po wer gradient. The e xperimental results for dif ferent benchmarks are sho wn in T able 6.4 for dif ferent resource constraints using model 2 and the a v erage data presented in Fig. 6.8. The results tak e into account 160 PAGE 178 0 5 10 15 20 0 5 10 (1) ARF S D control steps (c) >cycle power (Pc) > 0 5 10 15 20 0 5 10 (2) BPF S D control steps (c) >cycle power (Pc) > 0 10 20 30 0 5 10 (3) DCT S D control steps (c) >cycle power (Pc) > 0 10 20 30 0 5 10 (4) EWF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 (5) FIR S D control steps (c) >cycle power (Pc) > 0 2 4 6 8 0 5 10 (6) HAL S D control steps (c) >cycle power (Pc) > Figure 6.3. Cycle Po wer Consumptions for Resource Constraint RC1 0 5 10 15 0 5 10 15 20 (1) ARF S D control steps (c) >cycle power (Pc) > 0 5 10 15 20 0 5 10 15 20 (2) BPF S D control steps (c) >cycle power (Pc) > 0 10 20 30 0 5 10 (3) DCT S D control steps (c) >cycle power (Pc) > 0 10 20 30 0 5 10 15 20 (4) EWF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 (5) FIR S D control steps (c) >cycle power (Pc) > 1 2 3 4 5 0 5 10 15 20 (6) HAL S D control steps (c) >cycle power (Pc) > Figure 6.4. Cycle Po wer Consumptions for Resource Constraint RC2 161 PAGE 179 0 5 10 15 0 5 10 15 20 (1) ARF S D control steps (c) >cycle power (Pc) > 0 2 4 6 8 10 0 5 10 15 20 (2) BPF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 15 20 (3) DCT S D control steps (c) >cycle power (Pc) > 0 5 10 15 20 0 5 10 15 20 (4) EWF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 15 20 (5) FIR S D control steps (c) >cycle power (Pc) > 1 2 3 4 0 5 10 15 20 (6) HAL S D control steps (c) >cycle power (Pc) > Figure 6.5. Cycle Po wer Consumptions for Resource Constraint RC3 0 5 10 15 0 5 10 15 20 (1) ARF S D control steps (c) >cycle power (Pc) > 0 2 4 6 8 10 0 5 10 15 20 (2) BPF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 15 20 (3) DCT S D control steps (c) >cycle power (Pc) > 0 5 10 15 20 0 5 10 15 20 (4) EWF S D control steps (c) >cycle power (Pc) > 0 5 10 15 0 5 10 15 20 (5) FIR S D control steps (c) >cycle power (Pc) > 1 2 3 4 0 5 10 15 20 (6) HAL S D control steps (c) >cycle power (Pc) > Figure 6.6. Cycle Po wer Consumptions for Resource Constraint RC4 162 PAGE 180 T able 6.4. Po wer Estimates for Dif ferent Benchmarks (using Model 2) C Po wer reduction details, Ener gy sa vings, Number of clock c ycles and T ime penalty K R %ur %ur s%vr % l % l s % sV% s; p ¡ T C .b 0 .b 0 (%) .b 0 .b 0 (%) (%) (%) 1 2 3 4 5 6 7 8 9 10 11 12 1 9.30 2.64 71.58 0.26 0.52 76.54 71.99 48.64 18 1.6 A 2 18.33 4.68 74.49 0.26 0.52 77.01 68.91 48.64 13 1.4 R 3 18.59 4.74 74.49 0.26 0.52 76.47 71.35 49.87 11 1.5 F 4 18.59 7.23 61.13 0.26 0.52 63.42 56.77 24.34 11 1.5 A v erage v alues 70.42 73.36 67.25 42.87 1.5 1 9.30 2.40 74.15 0.26 0.52 79.18 66.48 47.74 17 1.3 B 2 18.33 4.44 75.80 0.26 0.52 78.34 56.67 47.74 17 1.2 P 3 18.59 4.74 74.99 0.52 1.35 81.23 73.26 49.48 9 1.4 F 4 18.59 7.23 61.13 0.52 0.87 64.84 64.38 32.72 9 1.4 A v erage v alues 71.52 78.78 65.20 44.42 1.3 1 9.30 2.64 71.58 0.26 0.52 76.54 52.25 44.02 29 1.1 D 2 9.30 2.64 71.58 0.26 0.52 76.54 52.25 44.02 29 1.1 C 3 18.59 4.74 74.49 0.26 0.40 76.29 68.68 44.66 15 1.4 T 4 18.59 7.47 59.85 0.26 0.40 61.44 66.21 40.31 15 1.4 A v erage v alues 69.38 72.70 59.85 43.25 1.2 1 9.30 2.40 74.15 0.26 0.52 79.18 42.22 45.43 27 0.9 E 2 18.07 4.07 77.49 0.26 0.52 80.09 34.42 41.70 27 0.9 W 3 18.07 4.07 77.49 0.26 0.40 79.38 55.29 41.32 16 1.2 F 4 18.07 6.55 63.75 0.26 0.40 65.49 50.50 35.03 16 1.2 A v erage v alues 73.22 76.03 45.60 40.87 1.1 1 9.30 3.01 67.62 0.26 0.52 72.46 56.30 43.27 15 1.3 F 2 9.30 3.91 57.99 0.26 0.52 62.55 56.36 43.27 15 1.3 I 4 18.59 5.04 72.87 0.26 0.40 74.64 48.61 48.61 11 1.0 R 5 18.59 7.53 59.51 0.24 0.40 61.09 24.70 17.86 11 1.2 A v erage v alues 64.50 69.69 46.49 38.25 1.2 1 9.30 2.40 74.15 0.26 1.48 89.75 72.62 51.11 7 1.6 H 2 18.33 4.44 75.80 0.26 1.48 83.62 65.08 51.11 5 1.4 A 4 18.33 4.44 75.80 0.52 0.87 79.99 72.68 52.20 4 1.5 L 5 18.33 6.92 62.65 0.52 0.87 66.04 57.34 25.35 4 1.5 A v erage v alues 72.10 79.85 66.93 44.94 1.5 A v erage v alues 70.19 75.07 58.55 42.43 1.3 163 PAGE 181 1 2 3 4 5 6 0 20 40 60 80 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 6 0 20 40 60 80 100 Different Benchmark Circuits >Peak Pow Diff Reduction (%) > 1 2 3 4 5 6 0 10 20 30 40 50 60 70 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 6 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction (%) > Figure 6.7. Percentage A v erage Reduction for Benchmarks using Model1 the po wer or ener gy consumptions due to the o v erheads. The results indicate that the ener gy and po wer reduction were similar with small dif ferences, b ut there were no changes in terms of time penalty W e conclude that the minor dif ference is due to the f act that the quantitati v e dif ference between the v alues of (@ i ji @ %=dQ% ) and (@ i @ ji @ @ % @d% ) are not signicant. W e did not pro vide the c ycle po wer plot for this model since it w as almost the same as that of model 1. 6.4 Conclusions F or deep submicron and nanometer technology designs for lo w po wer battery dri v en systems, simultaneous minimization of total ener gy and transient po wer is benecial. The CPF parameter dened and used in this w ork essentially f acilitates such simultaneous optimization. The datapath scheduling algorithm described in this paper is particularly useful for synthesizing data intensi v e application specic inte grated circuits. The algorithm attempts to optimize ener gy and po wer while k eeping the time penalty at minimum. The CPFScheduler algorithm assumes the number of dif164 PAGE 182 1 2 3 4 5 6 0 20 40 60 80 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 6 0 20 40 60 80 Different Benchmark Circuits >Peak Pow Diff Reduction (%) > 1 2 3 4 5 6 0 10 20 30 40 50 60 70 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 6 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction (%) > Figure 6.8. Percentage A v erage Reduction for Benchmarks using Model2 ferent types of resources at each v oltage le v el and the number of allo w able frequencies as resource constraints. The w ork pro vides a unied frame w ork for simultaneous multicost space metric optimization of dif ferent ener gy and po wer components in CMOS circuit design. Future w ork could address leakage reduction and interconnect issues. The ef fecti v eness of the CPF in the conte xt of a pipelined datapath and control intensi v e applications also needs to be in v estigated. 165 PAGE 183 CHAPTER 7 TRANSIENT PO WER MINIMIZA TION In the pre vious chapter we proposed a frame w ork for simultaneous reduction of the four parameters through datapath scheduling. A ne w parameter called c ycle po wer function is dened that captures the four parameters and it is minimized using heuristic based scheduling algorithm. In this chapter we modify the nonlinear$'%'&(denoted as$'%'&V() so that inte ger linear programming (ILP) can be used for its minimization during datapath scheduling. The model for$'%'&tak es into consideration the ef fect of switching acti vity on the po wer consumption of functional units. The rst scheme, CPFMVDFC combines both multiple supply v oltages (MV) and dynamic frequenc y clocking (DFC) for$'%'& (minimization [170 ], while the second scheme, CPFMVMC uses multiple supply v oltages (MV) and multic ycling (MC) [171 ]. W e conducted e xperiments on selected highle v el synthesis benchmark circuits for v arious resource constraints and estimated po wer ener gy and ener gy delay product for each of them. Experimental results sho w that signicant reductions in po wer ener gy and ener gy delay product can be obtained. The rest of the chapter is or ganized as follo ws. W e discuss the related w orks in the ne xt section. W e dene, the c ycle po wer prole function as the equally weighted sum of normalized mean c ycle po wer and normalized mean c ycle dif ferential po wer follo wed by the analysis of the functions ($%'&and$%'&^(). Since, the$%^&^(function is nonlinear and we aim at using linear programming for its minimization, we discuss the procedures to handle standard nonlinearities using linear programming. The ILP formulations for$'%'&(minimization using multiple supply v oltages and dynamic frequenc y clocking is discussed, follo wed by the ILP formulations for$'%'&(minimization using multiple supply v oltages and multic ycling. Then, we describe the ILPbased scheduling algorithm follo wed by the e xperimental results and conclusions. 166 PAGE 184 7.1 Modified Cycle P o wer Function In this section, we redene the parameter called c ycle po wer function ($'%'&) which captures the peak po wer the peak po wer dif ferential and the a v erage po wer of the datapath circuit. It should be noted that$%^&captures the transient po wer characteristics of the circuit and the minimization of$%^&using multiple v oltages could lead to reduction of ener gy In this section, we dene$%^&, study its nonlinear beha vior and modify it so that we can use inte ger linear programming (ILP) for its minimization. The datapath is represented as a sequencing data o w graph (DFG). The denitions and notations used in this chapter are the same as that of the pre vious chapter (T able 6.1. F ollo wing the same steps as in the pre vious chapter the c ycle po wer function$'%'&is modeled as an equally weighted sum of the normalized mean c ycle po wer (%F 5x{ l) and the normalized mean c ycle dif ference po wer (Y%F 5x{ l) as gi v en belo w .$'%'&.c%F 5x{ l^:< %F 5x{ ln0 %F 5x{ lQOQ %F 5x{ l(7.1) The$%^&has a v alue in the range [0,2]. In terms of peak c ycle po wer (% r kh) and peak c ycle dif ference po wer (Y% r kh),$%'&can be e xpressed as :$'%'& n n b§ O n n b§ n n b§ O n n n § (7.2) Thus, the c ycle po wer function ($%'&) can be written as follo ws.$%'& #yx # r r z r lk< x # r r Gz r O # x r r az r x r r az r lkh x # r r Gz r x # r r Gz r u (7.3) The abo v e function (Eqn. 7.3) can serv e as the objecti v e function for lo w po wer datapath scheduling. The minimization of this objective function using multiple supply volta g es, dynamic fr equency cloc king and multicycling can r educe both power and ener gy From the Eqns. 7.2, and 167 PAGE 185 7.3, we mak e follo wing observ ations about the c ycle po wer function ($%'&). The$'%'&is a nonlinear function. It is a function of four parameters, such as, a v erage po wer (%), peak po wer (% r k), a v erage dif ference po wer ( %) and peak dif ference po wer ( % r k). The absolute function (3or1) in the numerator (of Eqn. 7.3) contrib utes to the nonlinearity The comple x beha vior of the function is also contrib uted by the denominator parameters,% r khandY% r kh. W e need to de v elop scheduling algorithms that accept, an unscheduled DFG, the resource/time constraints, switching acti vity information, load capacitance, v oltage le v els and the number of allo w able frequenc y le v els as input parameters. F or optimum minimization of the function, such an algorithm has to be based on nonlinear optimization techniques, which are of lar ge time and space comple xity In this w ork, we aim at de v eloping inte ger linear programming (ILP) based model for minimization of the$%'&. W e alter the$%'&in order to simplify the ILPbased model. It is kno wn that the denominator parameters,% r kequals to % & and the % r kequals to %d% & It is e vident that %d% is upper bounded% for all control steps¤, since %8d% is a measure of mean dif ference error of% Thus, we conclude that % r khis upper bounded by% r kh. W e modify the$%^&by substituting % r kwith% r kand dene modied$%^&(denoted as$'%'&^() as follo ws.$%'&^( n n b§ O n n b§ n n n § n #U n n n b§ # x # r r {z r lk< yx # r r z r w O # ¦x r r z r yx r r z r lok< x r r z r (7.4) Unlik e$%'&, the$%'& (is dependent on three f actors,%,% r khand %. The absence of % r kh, in the denominator helps in reducing the comple xity of the ILP formulations (which will be discussed in ne xt section) to a great e xtent. W e minimize the modied c ycle po wer function ($%^& () instead of$%'&using the ILPbased model. 168 PAGE 186 The po wer models de v eloped in Eqn. 7.3 for the$%^&use parameters, such asTmb ,$om ,9mb andr The model can accomodate both the lookup table based ener gy (po wer) models and ener gy (po wer) macromodels. The generic model can also help in easy inte gration of a$%'&model in beha vioral synthesis tool that uses both a beha vioral po wer estimator and a datapath scheduler Using the dynamic ener gy model proposed in [123 ], the ef fecti v e switching capacitance can be e xpressed as,gm$m $tw m .2gm @ :hgm C 0(7.5) Here,mand$mare the parameters corresponding to the functional unit& M mas dened before.$tw mis a measure of the ef fecti v e switching capacitance of the functional unit& M m, which is a function of m @and m C; m @and m Care the a v erage switching acti vities on the rst and second input operands of resource& M m. It should be noted that in the abo v e switching model, (in Eqn. 7.5) the input pattern dependencies can be handled. Moreo v er the generic$'%'&model can be easily tuned to handle an y of the four modes of datapath circuit operation, such as, (i) single supply v oltage and single frequenc y (ii) multiple supply v oltages and single frequenc y (iii) multiple supply v oltages and dynamic frequenc y and (i v) multiple supply v oltage and multic ycling. F or the single supply v oltage and single frequenc y scheme,9mb andr is the same for all¤, while for multiple supply v oltage and multic yclingr is same for all¤. Using Eqn. 7.5, we re write Eqn. 7.4 as,$%'&'( # yx ¨ r z r lk< x # ¨ r {z r w O # # x # ¨ r az r x ¨ r az r lok< yx ¨ r az r u (7.6) The notation$w mb represents$w/ mfor the functional unit& M macti v e in control step¤. W e use the abo v e equation (Eqn. 7.6) as the objecti v e function for our scheduling algorithm.m @andm Care estimated using beha vioral simulation of a DFG with a set of input v ectors [167 168 169 ]. A lookup table is constructed that stores the$Pw/v alues for ( @and C) combinations for dif ferent types of functional units, such as multipliers and ALUs. W e use interpolation to nd the$'wv alues for the ( @and C) combinations that are not a v ailable in the lookup table. 169 PAGE 187 7.2 Modeling of Nonlinearities The modied c ycle po wer function ($'%'& () discussed in the pre vious section, is a nonlinear function The nonlinearity is because of the absolute function (3or1) and also because of the fr actional form of the function itself. The ILP formulations need to handle these tw o forms of nonlinearity W e rst address the transformations required to deri v e linear models of the nonlinear functions. Let us represent the general linear programming model as follo ws [172 ] : Minimize : ¤}TASubject to : mtA3mx:  *"#: <0(7.7) where,¤B,m:3mare kno wn constants andare the decision v ariables. 7.2.1 LP F ormulation In v olving Sum of Absolute De viations The general form of this type of programming can be represented as gi v en belo w [173 174 ]. Minimize : m 8DmBSubject to :8 m O m 3 m :  A"*["#: 0where,8Dm, is the de viation between the prediction and observ ation. The 8#m PAGE 188 Using these v ariables, we can re write the LP formulation in Eqn. 7.8 as follo ws. Minimize : m 8 @ m d"8 C m Subject to :8 @ m d8 C m O m T 3 m :  #n*"#: <0 8 @ m :8 C m *["#:  (7.10) If the product of8 @ mand8 C mis zero, then, 8 @ m d"8 C m 8 @ m O 8 C m 8 @ m Oo8 C m(7.11) Using the abo v e, we can write the LP formulation e xpressed in Eqn. 7.10 as sho wn belo w Minimize : m 8 @ m O+8 C mSubject to :8 @ m d8 C m O mtT#'3mx:  #n*"#: <0 8 @ m :8 C m *["#:  (7.12) The formulations in Eqn. 7.8 and 7.12 are equi v alent and the minimization of Eqn. 7.12 will result in the minimization of Eqn. 7.8. 7.2.2 LP F ormulation In v olving Fraction The general e xpression for the LP formulation in v olving fractions is considered belo w [174 ]. Minimize : ( f ( Subject to : mtA3mx:  Aq*"#: <0(7.13) 171 PAGE 189 where,¤Band6are kno wn constants and the denominator 6t#is strictly positi v e. Let us assume ne w v ariables as follo ws : T 6 T O 6TA @ A ' S(7.14) Using the abo v e transformation, the original formulation in Eqn. 7.13 can be modied to the follo wing. Minimize :¤ToBTTO ¤ H Subject to : mHd3mvH T 3m}:  6 B Of6EToHT = T :* "#: <0(7.15) The problems dened in Eqn. 7.13 and 7.15 are equi v alent. On solving the problem in Eqn. 7.15, we substitute,P AoB Tto get the results for6. Although the ILP formulations get complicated as the objecti v e function described in Eqn. 7.4 consists of both of the abo v e nonlinearities, it is much simpler than the ILPformulation of the Eqn. 7.3. W e observ e that the c ycle po wer uctuation (Y% ) corresponds to 8m}in Eqn 7.8. % is a measure of the absolute de viation of c ycle po wer from a v erage po wer and %is a measure of mean de viation of the c ycle po wer 7.3 ILP F ormulations to Minimize Cycle P o wer Function In this section, we discuss the ILP models for minimization of the modied c ycle po wer function ($'%'& (). W e describe the ILP models for tw o dif ferent scenario of ASIC design. The rst one tar gets design with multiple supply v oltages and dynamic frequenc y clocking (MVDFC). The other one tar gets multiple supply v oltages and multic ycling (MVMC) based designs. The ILP models formulated ensure that the dependenc y constraints and the resource constraints are satised. In order to formulate an ILP based model for Eqn. 7.6 and the scheduling schemes for the DFG, we use the follo wing notations (T able 7.3). 172 PAGE 190 T able 7.1. List of V ariables used in ILP F ormulations : maximum number of functional units of typeoperating at v oltage le v el>(& M ) m: as soon as possible (ASAP) time stamp for the operation m ;m: as late as possible (ALAP) time stamp for the operationm %./$ w/ m :B>v: r 0: po wer consumption of functional unit& M mat v oltage le v el> and operating frequenc yrused bymfor its e x ecution mb : decision v ariable which tak es the v alue ofif operationm is scheduled in control step¤using the functional unit& and¤has frequenc yr 8 m C l: decision v ariable which tak es the v alue ofif operation mis using the functional unit& and scheduled in control steps mb : latenc y for operationDmusing functional unit operating at v oltage>(in terms of number of clock c ycles) 7.3.1 Multiple V oltages and Dynamic Fr equency Clocking (MVDFC) In this subsection, we describe the ILP formulation for minimization of$'%'&Y(using multiple supply v oltages and dynamic frequenc y clocking. In dynamic frequenc y clocking [63 59 62 ], the clock frequenc y is v aried onthey based on the functional units acti v e in that c ycle. In this clocking scheme, all the units are clock ed by a single clock line which switches at runtime. The frequenc y reduction creates an opportunity to operate the dif ferent functional units at dif ferent v oltages, which in turn, helps in further reduction of po wer Objective Function : The objecti v e is to minimize the modied c ycle po wer function described in Eqn. 7.4 of the whole DFG o v er all control steps. Minimize :$'%'&1((7.16) Using Eqn. 7.4, this can be restated as : Minimize : n # n n n b§ (7.17) 173 PAGE 191 This objecti v e function has the tw o types of nonlinearities mentioned in the pre vious section. W e rst remo v e the nonlinearity introduced because of the fraction by putting the denominator as a constraint. Then, the problem in Eqn. 7.17 transformed to the one gi v en belo w Minimize :@ i i @ % O @ i i @ %Wd% Subject to : Peak po wer constraints (7.18) Ho we v er this transformed problem still has the nonlinearity in it because of the absolute function. This can be con v erted to an equi v alent problem using the transformation suggested in the pre vious section. Minimize :@ i i @ %O @ i i @ .c%O %h0Subject to : Modied peak po wer constraints (7.19) The peak po wer constraint in Eqn. 7.18 and the modied peak po wer constraint in Eqn. 7.19 will be discussed in later part of the subsection. The problem e xpressed in Eqn. 7.19 is simplied to : Minimize : i ji @ % Subject to : Modied peak po wer constraints (7.20) Using the decision v ariables, the objecti v e function is formulated as, Minimize : m,D r ~ mb i T%Y./$tw/ m :B>v: r 0Subject to : Modied peak po wer constraints (7.21) Minimize : m),D r ~ mb T% ( ./$tw m :B>: r 0Subject to : Modied peak po wer constraints (7.22) where,% ( ./$tw/ m :B>v: r 0is gi v en by%./$w m :B>: r 0 i . Uniqueness Constr aints : These constraints ensure that e v ery operation#mis scheduled to one unique control step within the mobility range ( m,;m) with a particular supply v oltage and operat174 PAGE 192 ing frequenc y W e represent them as, , , mb (7.23) Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in an earlier control step and its successors are scheduled in a later control step. These are modeled as, :0D: PAGE 193 Modied P eak P ower Constr aints : T o eliminate the nonlinearity introduced due to the absolute function, we modify the abo v e constraints, as outlined in Eqn. 7.18 and 7.19. The peak po wer constraints in Eqn. 7.26 is modied as,¤,'[¤ p,@ i m,D r ~ mb T%./$ w m :B>: r 0 d m),D r ~ mb %Y./$tw/ m :B>v: r 0t% ( r k(7.27) The% ( r kis a modied peak constraint which is added to the objecti v e function and minimized alongwith it. 7.3.2 Multiple V oltages and Multicycling (MVMC) In this subsection, we describe the ILP formulations based on the modied c ycle po wer function ($'%'& () using multiple supply v oltages and multic ycling. In this scheme, the functional units are operated at multiple supply v oltages. The functional units operating at lo wer v oltages may need to be acti v e in more than one consecuti v e control steps to complete e x ecution. Objective Function : The objecti v e is to minimize the$%'&(for the entire DFG. Using Eqn. 7.4, this can be represented as : Minimize :$%^&1( # n # n n n b§ (7.28) As discussed in the pre vious subsection, this objecti v e function has tw o types of nonlinearities, which are because of the absolute function and the fractional form. The fractional nonlinearity is remo v ed by introducing the denominators as a constraint. The corresponding constraints are kno wn as peak po wer constraints. W e remo v e the absolute function nonlinearity by modifying the peak po wer constraints which gi v e rises to modied peak po wer constraints. Thus, the problem in 176 PAGE 194 Eqn. 7.28 is transformed to the follo wing. Minimize :@ i i @ % O @ i i @ .c%O % 0Subject to : Modied peak po wer constraints (7.29) The peak po wer constraint and the modied peak po wer constraint are discussed in the later part of the subsection. The problem in Eqn. 7.29 is simplied to : Minimize : i ji @ % Subject to : Modied peak po wer constraints (7.30) Using the decision v ariables, the abo v e LP objecti v e function is formulated as, Minimize : C m),D r ~ 8 mb C C n r ~ @ i %Y./$tw/ m :B>v: r C 0Subject to : Modied peak po wer constraints (7.31) where,r C is the operating frequenc y le v el of the datapath circuit in multic ycling mode. Minimize : C m),D r ~ 8 m C C n r ~ @ % ( ./$tw m :B>: r C 0Subject to : Modied peak po wer constraints (7.32) where,% ( ./$ w/ m :B>v: r C 0 i T%Y./$ w/ m :B>v: r C 0, are modied po wer v alues. Uniqueness Constr aints : These constraints ensure that e v ery operationAmis scheduled in appropriate control steps within the mobility range ( m,;m) with a particular supply v oltage. Depending on the supply v oltage it may be operated at more than one clock c ycle. W e represent them as, ,' , @ n r ~ C 8 mb C C n r ~ @ (7.33) When the operators are computed at the highest v oltage, the y are scheduled in one unique control step, whereas, when the y are to be operated at lo wer v oltages the y need more than one clock c ycle for completion. Thus, for lo wer v oltage, the mobility is restricted. 177 PAGE 195 Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in earlier control step and its successors are scheduled in later control step. These constraints should also tak e care of the multic ycling operations. These are modelled as, :0: PAGE 196 7.4 ILPBased Scheduling Algorithm In this section, we discuss the solutions for the ILP formulations obtained in the pre vious section and de v elop scheduling algorithms for both MVDFC and MVMC schemes. The tar get architecture model assumed for the scheduling schemes is from [65 ]. Each functional unit has a re gister and a multiple xor associated with it. The re gister and the multiple xor operate at the same v oltage le v el as that of the functional unit. Le v el con v erters are used when a lo wv oltage functional unit dri v es a highv oltage functional unit [65 95 ]. A controller decides which of the functional units are acti v e in each control step and those that are not acti v e are disabled using the multiple xors. F or MVDFC scheme, the controller has a storage unit to store c ycle frequenc y inde x (¤ ru ) v alues obtained from scheduling. This serv es as the clock di viding f actor for the dynamic clocking unit. The c ycle frequenc yr is generated dynamically and a functional unit operating at one of the supply v oltages is acti v ated. The inputs to the algorithm are an unscheduled data o w graph (UDFG), the resource constraints, the number of allo w able v oltage le v els ( ), the number of allo w able frequencies ( ), the delay of each resource (6 DGF), the multiple xor (6 O A7), the re gister (6 : ) at dif ferent v oltage le v els. The delays of le v el con v erters (6 5 F ) is represented in the form of a matrix that sho ws the delay in con v erting one at v oltage le v el9 mto another v oltage le v el9 (where, both9 m :9 19 n ). The resource constraint includes the number of ALUs and multipliers at dif ferent v oltage le v els96m(where,96mm1[96n<). The scheduling algorithm determines ther¢ kw ,¤ ru time stamp for each operation, and v oltage le v el such that the function$'%'& ((Eqn. 7.6) is minimum. The ILP based scheduler which minimizes the modied c ycle po wer function$%'&Y(of the DFG is outlined in Fig. 7.1. In step 1, the scheduler constructs a lookup table for ef fecti v e switching capacitance for kno wn v alues of the a v erage switching acti vity pair as described in Eqn. 7.5. In step 2, the scheduler determines the switching acti vities at the inputs of each node by using beha vioral simulation of DFG. F or this purpose, a dif ferent set of application specic input v ectors (ha ving dif ferent correlations) are gi v en at the primary inputs of the DFG and a v erage switching acti vity at each inputs of other nodes are calculated [167 168 169 ]. It should be noted that if the lookup table (in step 1) does not ha v e the switching capacitance for an a v erage switching acti vity 179 PAGE 197 Input : UDFG, resource constraints, , , all96mP1H9n,6EDGF,6 O Aq,6 : ,6 5 F Output : scheduled DFG,r¢ kw ,p,¤ ru po wer ener gy and delay estimates Step 1 : Construct a look up table for ef fecti v e switching capacitance. Step 2 : Calculate the switching acti vities at each node through beha vioral simulation. Step 3 : Find ASAP schedule for the UDFG. Step 4 : Find ALAP schedule for the UDFG. Step 5 : Determine the mobility graph of each node. Step 6 : Modify the mobility graph for MVMC. Step 7 : Model the ILP formulations of the DFG using AMPL. Step 8 : Solv e the ILP formulations using LPSolv e. Step 9 : Find the scheduled DFG. Step 10 : Determine the c ycle frequencies (r ),r¢ kw and¤ ru for MVDFC scheme. Step 11 : Estimate the po wer and ener gy consumptions of the scheduled DFG. Figure 7.1. Scheduling for$'%'&(Minimization v alue (in step 2), then the scheduler uses interpolation techniques to nd the same. The third step is to determine the as soon as possible (ASAP) time stamp of each operation. The fourth step is the determination of the as late as possible (ALAP) time stamp of each v erte x for the DFG. The ASAP time stamp is the start time and the ALAP time stamp is the nish time of each operation. These tw o time stamps pro vide the mobility of an operation and the operation must be scheduled within this mobility range. This mobility graph needs to be modied for the MVMC scheme. The ILP formulations constructed based on the models described in section 7.3. The scheduler uses the modeling language AMPL to model the ILP formulations [166 ]. At this step, we calculate the po wer consumption of the functional units as follo ws. The operational delay of a functional unit is assumed as (6 DGF OI6 O A7 Of6 : OI6 5 F ). F or the MVMC scheme the operating frequenc y is the frequenc y corresponding to the operational delay at the highest operating v oltage of multiplier unit. On the other hand, for MVDFC scheme, the operating frequenc y of a functional unit is calculated based on these operational delay using the formulas gi v en in [48 ]. It is assumed to be the in v erse of operational delay of a functional unit at corresponding supply v oltage. W e get the switching capacitance from step 1 and step 2, and the po wer v alues are calculated whene v er 180 PAGE 198 0 2 5 6 7 4 Source Sink* * + + +NOP NOP 3 c0 c1 c2 c3 c4 1 (a) ASAP Schedule for EXP DFG 0 1 2 3 4 5 6 NOP NOP 7 * + + + Source Sink (b) ALAP Schedule for EXP DFG Figure 7.2. ASAP and ALAP Schedule for Example DFG (used to find Mobility Graph) necessary for dif ferent operating v oltages and frequencies. The scheduled DFG is obtained after the ILP formulation is solv ed using LPSolv e. Then, the scheduler determines ther6¢ kw ,¤ ru and c ycle frequenc y (r ) using the methods proposed in [48 ] based on the delay of each c ycle. Finally the po wer consumption, ener gy consumption and the ener gy delay product of the scheduled DFG are calculated. 7.4.1 CPFMVDFC Scheduling Scheme W e illustrate the solution for the ILP formulation in the MVDFC case, with the help of the DFG sho wn in Fig. 7.2. The ASAP schedule is sho wn in Fig. 7.2(a) and the ALAP schedule is sho wn in Fig. 7.2(b). From the ASAP and ALAP schedules, we obtained the mobility graph which is Fig. 7.3(a). W e get the ILP formulations using this mobility graph. W e solv ed the formulation using LPsolv e and based on the results, we obtained the scheduled DFG sho wn in Fig. 7.3(b) for the resource constraint (RC5), tw o multipliers atA9and one ALU operating atZ Z 9. Similarly other schedules can be obtained for dif ferent resource constraints. 181 PAGE 199 1 2 4 3 5 6 * + + + (a) Mobility Graph *2* +5 3 1*NOP NOP Source 0 Sink 7+ +3.3V 3.3V 4 3.3V 6 2.4V 2.4V 2.4V (b) Final Schedule Figure 7.3. Mobility Graph and Final Schedule for Example DFG for RC5 using MVDFC 7.4.2 CPFMVMC Scheduling Scheme W e illustrate the solution for the ILP formulations of the MVMC case, using the DFG sho wn in Fig. 7.2. The ASAP schedule is sho wn in Fig. 7.2(a) and the ALAP schedule is sho wn in Fig. 7.2(b). From the ASAP schedule (Fig. 7.2(a)) and the ALAP schedule (Fig. 7.2(b)), we obtained the mobility graph sho wn in Fig. 7.4(a). This mobility graph is dif ferent from that sho wn in Fig. 7.3(a). In the MVMC case, the mobility graph considers the multic ycle operations. In this illustration, we assume that we ha v e tw o operating v oltage le v els, and when the multipliers are operated at the lo wer v oltage, the y tak e tw o clock c ycles. It should be noted that the mobility graph will depend on the number of operating v oltages and the assumed operating frequenc y W e solv ed the ILP formulation using LPsolv e and based on the results we obtained the scheduled DFG sho wn is Fig. 7.4(b) for the resource constraint (RC5), tw o multipliers atA9and one ALUs operating atZ Z 9. 182 PAGE 200 * * + + + 1 2 3 4 5 6 c1 c2 c3 c4 c0 (a) Mobility Graph NOP Source 0 NOP 7 Sink+ +4 5* + *3 1 2 2.4V 2.4V 2.4V 3.3V 6 3.3V 3.3V (b) Final Schedule Figure 7.4. Mobility Graph and Final Schedule for Example DFG for RC5 using MVMC 7.5 Experimental Results The ILP based CPFMVDFC and CPFMVMC schedulers were tested with v e benchmark circuits :3Example circuit (EXP) (8 nodes, 3*, 3+, 9 edges)3FIR lter (11 nodes, 5*, 4+, 19 edges)3HAL dif ferential equation solv er (13 nodes, 6+, 2+, 2, 1, 16 edges)3IIR lter (11 nodes, 5*, 4+, 19 edges)3AutoRe gressi v e lter (ARF) (15 nodes, 5*, 8+, 19 edges). The follo wing notations are used to e xpress results (T able 7.5). W e use the lookup table method presented in Section 7.1 for a v erage switching capacitance calculation. The lookup table construction consists of tw o phases, such as input pattern generation and cell characterization. W e generate the primary input signals of dif ferent correlations using 183 PAGE 201 T able 7.2. List of V ariables used to Express the Results % r : peak po wer consumption (in ) for single supply v oltage and single frequenc y scheme % r : peak po wer consumption (in ) for multiple supply v oltages and dynamic frequenc y operation % r O: the peak po wer consumption (in ) for multiple supply v oltages and multic ycle operation % l : minimum po wer consumption (in ) for an y c ycle assuming single frequenc y and single supply v oltage %l : minimum po wer consumption (in ) for an y c ycle for dynamic clocking and multiple supply v oltage : e x ecution time for single frequenc y : e x ecution time for dynamic frequenc y O: e x ecution time for multic ycling operation ; : total ener gy consumption (in nanoJoule or) for single supply v oltage and single frequenc y scheme ; : total ener gy consumption (in) for multiple supply v oltages and dynamic frequenc y operation ; O: total ener gy consumption (in) for multiple supply v oltages and multic ycle operation % : a v erage po wer consumption (in ) for single supply v oltage and single frequenc y scheme which is calculated as the mean of the c ycle po wer consumptions % : a v erage po wer consumption (in ) for multiple supply v oltages and dynamic frequenc y operation, estimated as the mean of the c ycle po wer % O: a v erage po wer consumption (in ) for multiple supply v oltages and multic ycle operation, calculated as the mean of the c ycle po wer consumptions ;^ % : ener gy delay product (in4" @ =Joulesec orr ) for single supply v oltage and single frequenc y operation ( ; ) ;^ % : ener gy delay product (inr ) for multiple supply v oltage and dynamic frequenc y clocking operation ( N; ) ;^ % O: ener gy delay product (inr ) for multiple supply v oltage and multic ycle operation ( N; O O) s% r: percentage peak po wer reduction, for MVDFC scheme this is dened as, n t n v n t 4"D"and for MVMC scheme it is calculated as, n t n n t 4"D" s %: percentage dif ferential po wer reduction, which is calculated as n t n t n v n v n t n t P4"D"for MVDFC scheme and as n t n t n n n t n t 4"D"for MVMC scheme s%: percentage a v erage po wer reduction, for MVDFC sheme it isn t n v n t P4"D" and for MVMC scheme it isn t n n t 4"D" s;: percentage reduction in total ener gy is calculated as t v t 4"D" for MVDFC scheme and as t t 4"D"for MVMC scheme s;1Y%: percentage EDP reduction, calculated as n t n v n t 4"D" for MVDFC scheme and as n t n n P4"D"for MVMC scheme 184 PAGE 202 the autore gressi v e mo ving a v erage (ARMA) model [169 ]. W e perform the characterization of the physical implementations of the library modules a v ailable in [55 ] by applying the input patterns generated abo v e for some v alues of (m @ :hm C) pairs. Whene v er necessary we used interpolation to nd the a v erage switching capacitance for an y other v alues of (Tm @ :hm C) pairs that do not e xist in the lookup table. It should be noted that lar ger the size of lookup table, better is the accurac y The abo v e generated signals are propagated through dif ferent operators in the DFG and the a v erage switching acti vities are calculated as described in [169 ]. Both the scheduling algorithms, CPFMVDFC and CPFMVMC were tested using v e dif fer ent sets of resource constraints (RC1,RC2,RC3,RC4,RC5) : (1) multipliers (atA9andatZ Z 9) and ALUs (atA9andatZ Z 9), (2) multipliers (ZatA9) and ALUs (atA9andatZ Z 9), (3) multipliers (atA9) and ALUs (atZ Z 9), (4) multipliers (atA9andatZ Z 9) and ALUs (atZ Z 9), and (5) multipliers (atA9) and ALUs (atZ Z 9). The reason behind choosing the sets of resource constraints is that it co v ers a good representi v e of types of resources at dif ferent operating v oltages. The number of allo w able v oltage le v els is tw o (A9: Z Z 9) and maximum number of allo w able frequencies being three. The e xperimental results for v arious benchmark circuits are reported in T able 7.3 for CPFMVDFC scheduling scheme and in T able 7.4 for CPFMVMC scheduling scheme. The po wer/ener gy estimation include the po wer consumption of the o v erheads, such as le v el con v erters (data tak en from [55 ]). The results are reported for tw o supply v oltages. In case of CPFMVDFC scheduling the frequencies found out are (D* :hSD*:77RD*+). F or CPFMVMC scheduling scheme the operating frequenc y (r C ) isSD*+. W e plotted Fig. 7.5 and 7.6 to get a visual picture of the e xperimental results. The gures sho w the a v erage reductions for dif ferent benchmarks a v eraged o v er all resource constraints. It is ob vious from the gure that the reductions are signicant. It is also noted that for the reductions for MVDFC scheme is better than the MVMC scheme. The CPFMVDFC scheme w orks ef fecti v ely for all resource constraints and all benchmarks, where as, the CPFMVMC scheme does not produce good 185 PAGE 203 T able 7.3. Po wer Ener gy and EDP Estimates for Benchmarks using MVDFC Po wer Ener gy and Ener gyDelayProduct R C % % % % % 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 17.28 4.56 73.61 0.46 0.35 74.97 8.87 2.42 72.72 2.96 1.57 46.8 0.99 0.87 11.34 (1) 2 17.28 4.56 73.61 0.46 0.35 74.97 8.87 2.42 72.72 2.96 1.57 46.8 0.99 0.87 11.34 E 3 17.28 4.56 73.61 0.46 0.9 78.24 8.87 2.61 70.57 2.96 1.6 46.0 0.99 0.8 18.98 X 4 8.87 2.39 73.05 0.45 0.23 77.55 6.67 1.87 71.96 2.96 1.58 46.4 1.31 1.14 12.89 P 5 17.28 4.56 73.61 0.23 0.45 75.89 6.65 1.96 70.53 2.96 1.6 45.9 1.31 0.87 32.49 A v erage v alues 73.50 76.32 71.70 46.38 17.41 1 17.51 4.62 73.62 0.23 0.12 73.96 8.82 2.35 73.36 4.9 2.6 47.2 2.7 2.3 15.52 (2) 2 25.92 6.84 73.61 0.23 0.12 73.84 8.82 2.36 73.24 4.9 2.6 47.2 2.7 2.0 26.09 F 3 17.51 4.67 73.33 0.23 0.45 75.58 8.82 2.5 71.66 4.9 2.6 46.22 2.7 2.0 24.71 I 4 17.28 6.6 61.81 0.23 0.45 63.93 8.82 2.84 67.8 4.9 3.1 36.98 2.7 2.9 No R 5 17.51 4.67 73.33 0.23 0.45 75.58 8.82 2.5 71.66 4.9 2.6 46.22 2.7 2.0 24.71 A v erage v alues 71.14 72.60 71.54 44.76 16.21 1 17.51 4.62 73.62 0.46 0.35 74.96 13.25 3.55 73.21 5.9 3.12 47.0 2.62 2.43 7.25 (3) 2 26.15 6.9 73.61 0.46 0.35 74.5 13.25 3.55 73.21 5.9 3.12 47.0 2.62 2.43 7.25 H 3 17.74 4.78 73.05 0.46 0.9 76.97 13.25 3.73 71.85 5.9 3.17 46.2 2.62 2.23 12.55 A 4 17.51 6.71 61.68 0.23 0.45 63.77 10.6 3.73 64.8 5.9 4.07 30.8 3.27 3.85 No L 5 17.51 4.67 73.33 0.23 0.45 75.6 10.6 2.98 71.9 5.9 3.17 46.2 3.27 2.46 24.66 A v erage v alues 71.06 73.16 71.0 43.44 10.34 1 25.92 8.88 65.74 0.23 0.12 65.9 11.03 3.5 68.36 4.9 3.05 37.7 2.18 2.04 6.57 (4) 2 25.92 6.84 73.61 0.23 0.12 73.84 11.03 2.98 72.98 4.9 2.6 47.96 2.18 1.73 20.44 I 3 17.51 4.67 73.34 0.23 0.45 75.58 8.82 2.57 70.86 4.9 2.64 46.22 2.72 2.05 24.71 I 4 17.51 6.71 61.68 0.23 0.45 63.77 8.82 3.32 62.86 4.9 3.54 27.73 2.72 2.75 No R 5 17.51 4.67 73.33 0.23 0.45 75.58 8.82 2.5 71.66 4.9 2.64 46.22 2.72 2.05 24.71 A v erage v alues 69.54 71.65 69.34 41.17 15.24 1 8.87 2.34 73.62 0.23 0.12 74.1 4.5 1.22 72.9 5.0 2.64 47.2 5.56 4.4 20.83 (5) 2 8.87 2.34 73.62 0.23 0.12 74.1 4.5 1.22 72.9 5.0 2.64 47.2 5.56 4.4 20.83 A 3 8.87 2.39 73.05 0.23 0.45 77.6 4.5 1.4 68.9 5.0 2.74 45.3 5.56 3.8 31.63 R 4 8.87 2.39 73.05 0.23 0.45 77.6 4.5 1.4 68.9 5.0 2.74 45.3 5.56 3.8 31.63 F 5 8.87 2.39 73.05 0.23 0.45 77.6 4.5 1.4 68.9 5.0 2.74 45.3 5.56 3.8 31.63 A v erage v alues 73.28 76.20 70.5 46.06 27.31 Ov erall a v erage 71.70 74.0 70.82 44.36 17.31 186 PAGE 204 T able 7.4. Po wer ener gy and EDP Estimates for Benchmarks using MVMC Po wer Ener gy and Ener gyDelayProduct R C % % % % % 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 17.28 13.2 23.61 0.46 0.35 23.6 8.87 6.84 22.9 3.0 2.03 31.47 0.99 0.9 8.63 (1) 2 17.28 13.7 20.83 0.46 0.35 20.8 8.87 6.96 21.53 3.0 1.57 46.8 0.99 0.7 29.07 E 3 17.28 9.12 47.22 0.46 0.46 48.51 8.87 5.61 36.75 3.0 1.57 46.0 0.99 0.89 9.98 X 4 8.87 13.43 N A 0.23 0.23 N A 6.67 6.77 N A 3.0 2.5 16.46 1.31 1.11 15.33 P 5 17.28 9.35 45.9 0.23 0.23 46.51 6.65 5.61 15.64 3.0 1.6 46.0 1.31 0.89 32.5 A v erage v alues 27.51 27.88 19.36 37.35 19.10 1 17.51 17.76 N A 0.23 0.23 N A 8.87 7.67 13.04 4.9 3.09 37.0 2.72 2.06 24.38 (2) 2 25.92 13.68 47.22 0.23 0.12 47.21 8.82 7.66 13.15 4.9 2.59 47.2 2.72 1.72 36.64 F 3 17.51 9.35 46.6 0.23 0.23 47.22 8.82 7.75 12.13 4.9 2.64 46.22 2.72 2.05 24.71 I 4 17.28 13.43 22.28 0.23 0.23 22.58 8.82 7.51 14.85 4.9 4.0 18.5 2.72 2.66 2.19 R 5 17.51 9.35 46.6 0.23 0.23 47.22 8.82 6.65 24.6 4.9 2.64 46.22 2.72 2.05 24.71 A v erage v alues 32.54 32.85 15.55 39.03 22.53 1 17.51 17.76 N A 0.46 0.35 N A 13.25 9.08 31.47 5.9 4.0 31.6 2.62 2.68 N A (3) 2 26.15 13.8 47.23 0.46 0.35 47.64 13.25 9.24 30.26 5.9 3.2 47.0 2.62 2.08 20.61 H 3 17.74 9.58 46.0 0.46 0.46 47.22 13.25 7.98 39.77 5.9 3.2 46.19 2.62 2.46 6.11 A 4 17.51 13.43 23.3 0.23 0.23 23.61 10.6 9.0 15.2 5.9 5.0 15.4 3.27 3.32 N A L 5 17.51 9.35 46.6 0.23 0.23 47.22 10.6 6.41 39.53 5.9 3.17 46.18 3.27 2.82 13.76 A v erage v alues 32.63 33.14 33.14 37.27 8.10 1 25.92 17.76 31.48 0.23 0.12 31.34 11.03 8.95 18.85 4.9 4.0 19.22 2.18 2.2 N A (4) 2 25.92 13.8 46.76 0.23 0.12 46.75 11.03 7.68 30.37 4.9 2.6 47.2 2.18 1.72 20.81 I 3 17.51 9.12 47.92 0.23 0.23 48.55 8.82 5.82 34.01 4.9 2.6 46.22 2.72 2.34 13.96 I 4 17.51 13.43 23.3 0.23 0.23 23.61 8.82 7.51 14.85 4.9 3.54 27.73 2.72 2.36 13.28 R 5 17.51 9.12 47.92 0.23 0.23 48.55 8.82 5.82 34.01 4.9 2.64 46.22 2.72 2.34 16.23 A v erage v alues 39.48 39.76 26.42 37.32 12.76 1 8.87 9.24 N A 0.23 0.12 N A 4.5 3.58 20.44 5.0 2.64 47.22 5.56 3.81 31.4 (5) 2 8.87 9.24 N A 0.23 0.12 N A 4.5 3.58 20.44 5.0 2.64 47.22 5.56 3.81 31.4 A 3 8.87 9.35 N A 0.23 0.23 N A 4.5 3.65 18.9 5.0 2.74 45.3 5.56 3.95 28.9 R 4 8.87 13.43 N A 0.23 0.23 N A 4.5 3.56 20.9 5.0 3.19 36.24 5.56 4.60 17.11 F 5 8.87 9.35 N A 0.23 0.23 N A 4.5 3.65 18.9 5.0 2.74 45.3 5.56 3.95 28.9 A v erage v alues 0 0 19.92 44.26 27.54 Ov erall a v erage 26.44 26.73 22.51 39.05 17.99 187 PAGE 205 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >Peak Pow Diff Reduction (%) > 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction (%) > Figure 7.5. A v erage Reductions in Po wer or Ener gy for Benchmarks using CPFMVDFC results for ARF benchmark. W e did not nd an y w ork in the literature that deals with simultaneous reduction of ener gy and transient po wer so we could not pro vide comparison with an y other w orks. In order to study the po wer consumption per c ycle, we plotted the po wer prole for dif ferent benchmarks o v er all the control steps (clock steps). Fig. 7.7, 7.8, 7.9, 7.10 and 7.11 sho w po wer prole for benchmarks for resource constraints RC1, RC2, RC3, RC4 and RC5 respecti v ely The curv es labeled as SF correspond to the prole when the schedule is operated at a single frequenc y (which is the maximum frequenc y of slo wer operator multiplier) and single v oltage. The proles labeled as DFC correspond to the case when dynamic clocking and multiple v oltage scheme is used. Similarly the proles labeled as MC is for the MVMC scheme. The ef fecti v eness of the proposed scheduling schemes is ob vious from the gures. 188 PAGE 206 1 2 3 4 5 0 10 20 30 40 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 0 10 20 30 40 Different Benchmark Circuits >Peak Pow Diff Reduction (%) > 1 2 3 4 5 0 5 10 15 20 25 30 35 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 0 10 20 30 40 50 Different Benchmark Circuits >Energy Reduction (%) > Figure 7.6. A v erage Reductions for Benchmarks using CPFMVMC 7.6 Conclusions In lo w po wer deigns for portable applications, the simultaneous minimization of total ener gy and transient po wer is essential. The modifedCPF parameter dened and used in this w ork essentially f acilitates such simultaneous optimization using ILP formulations. The optimization is per formed using MVDFC scheme and MVMC scheme. The datapath scheduling algorithm described in this chapter is particularly useful for synthesizing data intensi v e application specic inte grated circuits. The algorithm attempts to optimize ener gy and po wer while maintaining performance. The scheduling algorithm assumes number of dif ferent types of resources at each v oltage le v els (both CPFMVDFC and CPFMVMC) and the number of allo w able frequencies (CPFMVMC scheme) as resource constraints. The ener gy delay product for both the CPFMVDFC and CPFMVMC scheduling scenario w as estimated to k eep track of the ef fect of scheduling algorithms on 189 PAGE 207 1 2 3 4 0 5 10 15 20 (1) EXP SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 (2) FIR SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 (3) HAL SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 0 5 10 15 20 25 30 (4) IIR SF DFC MC Control steps (c) >Cycle power (Pc) > Figure 7.7. Po wer Profile for Benchmark for Resource Constraint RC1 circuit performance. The CPFMVDFC scheduling resulted in reduction of EDP for all benchmarks and all resource constraints, which sho ws its ef fecti v eness. On the other hand, the CPFMVMC scheme resulted in impro v ement in EDP in almost all cases, e xcept for a fe w cases, where there w as no impro v ement. The r esults clearly indicate that multiple supply volta g e and dynamic fr equency cloc king sc heme yields better power and ener gy minimization than multiple supply volta g e and multicycling sc heme The ef fecti v eness of the scheduling schemes in the conte xt of pipelined datapath and control intensi v e applications, needs to be in v estigated. 190 PAGE 208 1 2 3 4 0 5 10 15 20 (1) EXP SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 25 30 (2) FIR SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 25 30 (3) HAL SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 25 30 (4) IIR SF DFC MC Control steps (c) >Cycle power (Pc) > Figure 7.8. Po wer Profile for Benchmark for Resource Constraint RC2 1 2 3 4 5 0 5 10 15 20 (1) EXP SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (2) FIR SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (3) HAL SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (4) IIR SF DFC MC Control steps (c) >Cycle power (Pc) > Figure 7.9. Po wer Profile for Benchmark for Resource Constraint RC3 191 PAGE 209 1 2 3 4 0 2 4 6 8 10 12 14 (1) EXP SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 (2) FIR SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 (3) HAL SF DFC MC Control steps (c) >Cycle power (Pc) > 1 2 3 4 5 6 0 5 10 15 20 (4) IIR SF DFC MC Control steps (c) >Cycle power (Pc) > Figure 7.10. Po wer Profile for Benchmark for Resource Constraint RC4 1 2 3 4 5 0 5 10 15 20 (1) EXP SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (2) FIR SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (3) HAL SF DFC MC Control steps (c) >Cycle power (Pc) > 0 2 4 6 8 0 5 10 15 20 (4) IIR SF DFC MC Control steps (c) >Cycle power (Pc) > Figure 7.11. Po wer Profile for Benchmark for Resource Constraint RC5 192 PAGE 210 CHAPTER 8 PO WER FLUCTU A TION MINIMIZA TION In this chapter we describe a ne w datapath scheduling scheme for the reduction of c ycle po wer uctuation at beha vioral le v el using inte ger linear programming (ILP) based models [175 ]. W e dev elop a po wer model to capture the c ycle po wer uctuation as c ycletoc ycle po wer gradient using switching acti vity supply v oltages and operating frequenc y Then, we pro vide ILP based models for its minimization assuming three modes of circuit operation, such as (1) single supply v oltage and single operating frequenc y (SVSF), (2) multiple supply v oltages and dynamic frequenc y (MVDFC) and (3) multiple supply v oltages and multic ycling (MVMC). The ef fecti v eness of our scheduling technique is measured by estimating the mean po wer gradient, the peak po wer (%r) consumption, the a v erage po wer consumption (%Tk) and the po wer delay product (%' %) of the scheduled data o w graph. W e compare the MVDFC and MVMC based scheduling algorithms with the results of SVSF based scheduling algorithm. It may be noted that in the case of multiple supply v oltage schemes, the po wer consumption in the le v el con v erters is tak en into account. Similarly in hte case of dynamic frequenc y clocking, the o v erhead due to dynamic clocking unit is considered. The dynamic frequenc y clocking methodology is more ef fecti v e for data intensi v e signal processing applications. The proposed scheduling algorithms are resource constrained. F or the SVSF scheme the resource constraint is the number of functional units. On the other hand, both the MVDFC and MVMC scheduling schemes use the number and type of functional units at dif ferent operating v oltages as the resource constraints. In addition, the MVDFC scheme uses a certain number of allo w able frequencies as resource constraints. 193 PAGE 211 8.1 P o wer Fluctuation Modeling In this section, we discuss dif ferent po wer terminologies with reference to a datapath circuit. Let us assume that the datapath is represented in the form of a sequencing data o w graph. The datapath uses v arious functional units operating at dif ferent supply v oltages. The le v el con v erters are considered as resources operating in the control step in which it needs to step up signal. The dynamic clocking unit (DCU) that generates dynamic frequenc y is considered as a resource operating in all the control steps. Our aim is to de v elop po wer models using generic terms such as switching acti vity supply v oltages and operating frequencies. The intention of using such parameters is to mak e the po wer model a general one, independent of an y specic ener gy or po wer models. It can accomodate both the lookup table based ener gy (po wer) models and ener gy (po wer) macromodels. The generic model can also help in easy inte gration of the proposed po wer model in a beha vioral synthesis tool that uses both beha vioral po wer estimator and datapath scheduler Moreo v er the generic model can be easily tuned to handle an y of the three modes of datapath circuit operation, such as (i) single supply v oltage and single frequenc y (SVSF), (ii) multiple supply v oltages and dynamic frequenc y (MVDFC), and (iii) multiple supply v oltage and multic ycling (MVMC). F or MV scheme the datapath uses functional units operating at dif ferent supply v oltages. In this mode the le v el con v erters are considered as resources operating in the control step in which it needs to step up signal. Let@7:BvC:4EEEE:BvFbe a set ofobserv ations from a gi v en distrib ution. The sample mean (which is an unbiased estimator for the population mean,') is @ F F m@ vm. The observ ationtoobserv ation gradient can be dened as, mdQm @, where 8. The mean gradient is gi v en by@ F @ F mC mdvm @. It may be noted that there aredgradients forobserv ations. The notations used in the description is gi v en in T able 8.1. It may be noted that for single frequenc y and single supply v oltage mode of operation,9mb andr are the same for an y clock c ycle (¤) and resource (). Similarly for multic ycling operation ther are the same for an y clock c ycle (¤). The po wer consumption for an y control step¤is gi v en by Eqn. 8.1. This is the total po wer consumption of all functional units acti v e in control step¤. This also includes the po wer consumption of the le v el con v erters where the le v el con v erters are considered as resources operating in a c ycle 194 PAGE 212 T able 8.1. Notations used in the Description p: total number of control steps in the DFG : total number of operations in the DFG ¤: a control step or a clock c ycle in DFG (' p) m: an y operation,' , % : the total po wer consumption of all functional units acti v e in control step¤(c ycle po wer consumption) %ur: peak po wer consumption for the DFG equal toi.c%Th0 & %ik: mean po wer consumption of the DFG (a v erage% ) %, : po wer gradient for c ycle¤(where,¤ N^ p) %,tr: peak po wer gradient of the DFG which is equal to.c%1, 0 & *%1,: mean po wer gradient of the DFG o v er¤ N^ p & M : an y functional unit of typeoperating at v oltage le v el> & M m: an y& M needed bymfor its e x ecution (mP1e& M ) & M mb : an y functional unit& M macti v e in control step¤ : total number of functional units acti v e in step¤ (same as the number of operations scheduled in¤) mb : switching acti vity of resource& M mb 96mb : operating v oltage of resource& M mb $ mb : load capacitance of resource& M mb r : frequenc y of control step¤ ¤, if the current resource is dri v en by a resource operating at lo wer v oltage.% W : mE@ mb $m 9 C mb r (8.1) The peak po wer consumption of the DFG is the maximum po wer consumption o v er all the control steps which can be e xpressed as belo w .% r % & @z i : m@ gmb $mb 9 C mb r & @z i(8.2) The mean c ycle po wer consumption of the DFG (%k) is dened as,%ik @ i i @ % @ i i @ : m@ gmb $mb 9 C mb r (8.3) 195 PAGE 213 The mean c ycle po wer%kis an unbiased estimate of the a v erage po wer consumption of the DFG. The true a v erage po wer consumption of the DFG is the total ener gy consumption of the DFG per clock c ycle or per second. The po wer gradient%1, for an y control step¤is dened as the absolute dif ference of po wer consumption from the pre vious control step, as gi v en belo w .%1, % d% @ & C< i : m@ m <$ mb <9 C mb r d! : m@ m @ $ mb @ 9 C m @ r @ & C< i (8.4) The peak of the po wer gradients is denoted as (%1, r) :%, r % d% @ & C< i : m@ m $mb 9 C mb r d : m@ m @h$mb @h9 C m @ r @ & / C< i (8.5) The mean po wer gradient*%1,is calculated as,*%1, @ i @ ji C %1, @ i @ ji C % d% @ @ i @ i C : m@ m $mb 9 C m r d! : mE@ mb @<$mb @9 C mb @ r @ (8.6) The abo v e generic po wer models are independent of an y specic ener gy or po wer models. Using the dynamic ener gy model proposed in [51 ] we can e xpress the ef fecti v e switching capacitance of our proposed model as,gm$m $tw m .2gm @ :hgm C 0(8.7) Here, themand$mare the parameters corresponding to the functional unit& M mas dened before. The$tw mis a measure of the ef fecti v e switching capacitance of functional unit& M m, which is a function ofm @andm C; them @andm Care the a v erage switching acti vities on the rst and second input operands of& M m. Similarly an y other po wer or ener gy models can be incorporated. It should be noted that the abo v e switching model (in Eqn. 8.7) handles input pattern dependencies. Using 196 PAGE 214 the abo v e Eqn. 8.7 we can re write Eqn. 8.6 as follo ws.*%1, @ i @ i C : mE@ $tw mb 9 C mb r d : m@ $tw mb @ 9 C mb @ r @ (8.8) W e use the abo v e*+%,as the objecti v e function for lo w po wer datapath scheduling. W e mak e the follo wing observ ations about the*+%1,. It is a nonlinear function because of the absolute function (3or1). It is a function of parameters, such as switching acti vity capacitance, operating v oltage and operating frequenc y W e will use the ILP formulations to minimize*%1,through datapath scheduling for three modes of datapath operation, namely SVSF MVDFC and MVMC as described before. The critical path delay of the DFG can be calculated as, i m@ @ (8.9) It should be noted that ther is the same for single frequenc y and multic ycling operations for all v alues of¤and may be dif ferent for dynamic frequenc y clocking operations. The po wer delay product of the DFG is dened as the product of the a v erage po wer consumption and critical path delay as sho wn belo w .%' % %ik(8.10) Using Eqn. 8.3, 8.7, and 8.9, we ha v e the follo wing e xpression for the po wer delay product.%' % @ i ji m@ : m@ $tw mb 9 C mb r tji m@ @ (8.11) T o study the impact of the scheduling algorithms on the performance of the datapath we estimate the po wer delay product of the scheduled DFGs using the abo v e e xpression. 8.2 Modeling of Nonlinearities It is clear from the Eqn. 8.8 that the*%1,is a nonlinear function The nonlinearity is because of the presence of absolute function (<3£or^). The ILP formulations has to handle this form of 197 PAGE 215 nonlinearity In this section, we address the transformations that help in linear modelling of the nonlinear functions. The general form of linear programming can be represented as [173 174 ] : Minimize : m 8 m Subject to :8mO[ mtA^3mz:a *"#: <0(8.12) where,8Dm, is the de viation between the prediction and observ ation. The 8#m PAGE 216 The problem in Eqn. 8.12 and 8.16 are equi v alent and minimization of Eqn. 8.16 will result in minimization of Eqn. 8.12. 8.3 ILP F ormulations to Minimize Mean P o wer Gradient In this section, we discuss the ILP models for minimization of*%1,for v arious modes of datapath operations, such as SVSF MVDFC and MVMC. It may be noted that dif ferent decision v ariables are to be used for the three dif ferent modes. W e rst discuss the formulations using MVDFC follo wed by MVMC. The formulation for SVSF is not presented since it is tri vial one. The notations used in ILP formulations is gi v en in T able 8.2. T able 8.2. Notations used in ILP formulations : maximum number of functional units& M q m: as soon as possible (ASAP) time stamp for the operationm ;m: as late as possible (ALAP) time stamp for the operationm %./$tw m :B>v: r 0: po wer consumption of functional unit& M mat v oltage>and frequenc yr used bymfor its e x ecution mb : decision v ariable which tak es the v alue ofif operation mis scheduled in control step¤using the functional unit& and¤has frequenc yr 8 mb C l: decision v ariable which tak es the v alue ofif operationmis using an y& and scheduled in control steps mb : latenc y for operationDmusing resource operating at v oltage> (in terms of number of clock c ycles) 8.3.1 F ormulations using Multiple V oltages and Dynamic Fr equency In dynamic frequenc y clocking [59 62 ], the clock frequenc y is v aried onthey based on the functional units acti v e in that c ycle. In this clocking scheme, all the units are clock ed by a single clock line which switches at runtime. The frequenc y reduction creates an opportunity to operate the dif ferent functional units at dif ferent v oltages, which in turn, helps in further reduction of po wer Objective Function : The objecti v e is to minimize the mean po wer gradient*%1,described 199 PAGE 217 in Eqn. 8.8 of the whole DFG o v er all control steps. Minimize :*%1,(8.17) Using Eqn. 8.6, this can be restated as : Minimize :@ i @ i C % d% @(8.18) This problem has the nonlinearity in it because of the absolute function. This can be con v erted to an equi v alent problem using the transformation suggested in the pre vious section. Minimize :@ i @ i C .c% OQ% @h0Subject to : Po wer gradient constraints (8.19) The abo v e problem in Eqn. 8.19 is simplied to : Minimize :C i @ i @ C % OQ%T@iOQ% iSubject to : Po wer gradient constraints (8.20) Using the decision v ariables and abo v e LP objecti v e function is formulated as, Minimize : C i @ ji @ C m,D r ~ m %./$ w m :B>v: r 0gO m),D r ~ m @B %./$ w/ m :B>v: r 0 O m),D r ~ mb i %./$tw m :B>: r 0Subject to : Po wer gradient constraints (8.21) Uniqueness Constr aints : These constraints ensure that e v ery operation#mis scheduled to one unique control step within the mobility range ( m,; m) with a particular supply v oltage and operating frequenc y W e represent them as, , , mb (8.22) 200 PAGE 218 Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in an earlier control step and its successors are scheduled in a later control step. These are modelled as, :0D: PAGE 219 Objective Function : The objecti v e is to minimize the mean po wer gradient*%1,described in Eqn. 8.8 of the whole DFG o v er all control steps. Minimize :*%1,(8.26) Using Eqn. 8.6, this can be restated as : Minimize :@ i @ ji C % d% @(8.27) This problem has the nonlinearity in it because of the absolute function. This can be con v erted to an equi v alent problem using the transformation suggested in the pre vious section. Minimize :@ i @ ji C .c% OQ% @h0Subject to : Po wer gradient constraints (8.28) The abo v e problem in Eqn. 8.28 is simplied to : F ollo wing the similar steps as in the pre vious section (section 8.3.1) and using the transformations, we redene the objecti v e function. Minimize :C i @ ji @ C % OQ%T@iOQ% iSubject to : Po wer gradient constraints (8.29) Then, using the decision v ariables the objecti v e function is formulated as, Minimize : C i @ ji @ C C m),D r ~ 8 mb C C n r ~ @ %./$ w m :B>v: r 0 O m),D r ~ 8Dmb @B@<%./$tw m :B>: r 0 O m),D r ~ 8 mb i i %./$ w/ m :B>v: r 0Subject to : Po wer gradient constraints Uniqueness Constr aints : These constraints ensure that e v ery operationAmis scheduled in appropriate control steps within the mobility range ( m,;m) with a particular supply v oltage. Depending 202 PAGE 220 on the supply v oltage it may be operated at more than one clock c ycle. W e represent them as, ,' , @ n r ~ C 8 mb C C n r ~ @ (8.31) When the operators are operating at highest v oltage, the y are scheduled in one unique control step, whereas, when the y are to be operated at lo wer v oltages the y need more than one clock c ycle for completion. Thus, for lo wer v oltage the mobility is restricted. Pr ecedence Constr aints : These constraints guarantee that for an operation#m, all its predecessors are scheduled in an earlier control step and its successors are scheduled in a later control step. These constraints should also tak e care of the multic ycling operations. These are modelled as, :0:< m 1%}L6 5 , C O mb d q0B8 m C C n r ~ @ d C H8 h C C n r ~ @ d'(8.32) Resour ce Constr aints : These constraints mak e sure that no control step contains more than& operations of typeoperating at v oltage>. These can be enforced as,v>and ,' p, m),D r ~ C 8 m C C n r ~ @ (8.33) P ower Gr adient Constr aints : These constraints are introduced to eliminate the absolute function nonlinearity of the objecti v e function. These constraints can be enforced as, ,) p, m,D r ~ 8 mb C C n r ~ @ %./$ w m :B>v: r C 0 d m),D r ~ 8 mb C @ C n r ~ C %./$tw m :B>: r C 0 %1, r(8.34) Where,%1,ris po wer gradient constraint which is added to the objecti v e at minimized alongwith it. 203 PAGE 221 8.4 Scheduling Algorithm In this section, we will discuss the solutions for the ILP formulations obtained in the pre vious section and de v elop scheduling algorithms for both MVDFC and MVMC schemes. The tar get architecture model assumed by the scheduling schemes is same as the one used in [65 ]. All functional units ha v e a re gister each and a multiple xor Each functional unit feeds a single re gister The re gister and the multiple xor operate at the same v oltage le v el as that of the functional units. Le v el con v erters are used when a lo wv oltage functional unit is dri ving a highv oltage functional unit [65 95 ]. A controller decides which of the functional units are acti v e in each control step and those that are not acti v e are disabled using the multiple xors. F or MVDFC scheme, the controller has a storage unit to store the parameters, c ycle frequenc y inde x (¤ ru ) obtained from the scheduling, which serv es as clock di viding f actor for the dynamic clocking unit. The c ycle frequenc yr is generated dynamically and a functional unit operating at one of the supply v oltages is acti v ated. The inputs to the algorithm are an unscheduled data o w graph (UDFG), the resource constraints, the number of allo w able v oltage le v els ( ), the number of allo w able frequencies ( ), delay of each resource (6 DGF), multiple xor (6 O A7), re gister (6 : ) at dif ferent v oltage le v els. The delays of le v el con v erters (6 5 F ) is represented in the form of a matrix that sho ws the delay in con v erting one at v oltage le v el9 mto another v oltage le v el9 (where, both9 m :9 19 n ). The resource constraint includes the number of ALUs and multipliers at dif ferent v oltage le v els9m(where,96mB196n<). The scheduling algorithm determines the proper time stamp for each operation,r¢ k£w ,¤ ru (using [48 ]) and v oltage le v el such that the function*+%1,(Eqn. 8.8) is minimum. The ILP based scheduler which minimizes modied c ycle po wer prole function of the DFG is outlined in Fig. 8.1. In step 1, the scheduler constructs a lookup table for ef fecti v e switching capacitance for kno wn v alues of a v erage switching acti vity pair as described in Eqn. 8.7. In step 2, the scheduler determines the switching acti vities at the inputs of each node by using beha vioral simulation of DFG. F or this purpose, dif ferent set of application specic input v ectors (ha ving dif ferent correlations) are gi v en at the primary inputs of the DFG and a v erage swtiching acti vity at each inputs of other nodes are calculated [167 169 ]. It should be noted that if the lookup table (in step 1) does not ha v e the switching capacitance for a pair of input a v erage swtiching acti vities 204 PAGE 222 Input : DFG, Constraints, V oltage and Freq. Le v els, Delays Output : Scheduled DFG,r¢ kw ,p,¤ ru Po wer estimates Step 1 : Construct ef fecti v e switching capacitance lookup table. Step 2 : Calculate the switching acti vities for each node. Step 3 : Find ASAP and ALAP schedule of the UDFG. Step 4 : Determine the mobility graphs for dif ferent schemes. Step 5 : Calculate operating frequenc y of FUs using delays. Step 6 : Model the ILP formulations of DFG using AMPL. Step 7 : Solv e the ILP formulations using LPSolv e. Step 8 : Obtain the scheduled DFG. Step 9 : Determiner ,r¢ k£w and¤ ru for MVDFC scheme. Step 10 : Estimate the po wer and delay of the scheduled DFG. Figure 8.1. Scheduling for*+%1,Minimization (in step 2), then the scheduler uses interpolation techniques to nd the same. The third step is to determine the as soon as possible (ASAP) time stamp of each operation. The fourth step is the determination of the as late as possible (ALAP) time stamp of each v erte x for the DFG. The ASAP time stamp is the start time and ALAP time stamp is the nish time of each operation. These tw o time stamps pro vide the mobility of a operation and the operation must be scheduled in this mobile range. This mobility graph needs to be modied for the MVMC scheme. Then the scheduler nds the ILP formulations based on the models described before. The scheduler uses modeling language AMPL to model the ILP formulations [166 ]. At this step, we calculate the po wer consumption of the functional units as follo ws. The operational delay of a functional unit is assumed as (6DGFO"6 O A7O6 : O"6 5 F ). F or the MVMC scheme the operating frequenc y is the frequenc y corresponding to operational delay at the highest operating v oltage of multiplier unit. On the other hand, for MVDFC scheme operating frequenc y of a functional unit is assumed to be the in v erse of operational delay of a functional unit at corresponding supply v oltage. W e get the switching capacitance from step 1 and step 2, and for dif ferent operating v oltages and frequencies the po wer v alues are calculated whene v er necessary After the ILP formulation is solv ed using LPSolv e the scheduled DFG is obtained. Then, the scheduler determines the c ycle frequencies for MVDFC scheme using the methods proposed in [48 ]. Finally po wer consumptions, ener gy consumptions and ener gy delay product of the scheduled DFG is calculated. 205 PAGE 223 0 2 5 6 7 4 Source Sink * + + + NOP NOP 3 c0 c1 c2 c3 c4 1 (a) ASAP Schedule 0 1 2 3 4 5 6 NOP NOP 7 * + + + Source Sink (b) ALAP Schedule 1 2 4 3 5 6 * + + + (c) Mobility for MVDFC * + + + 1 2 3 4 5 6 c1 c2 c3 c4 c0 (d) Mobility for MVMC Figure 8.2. Example Data Flo w Graph (DFG) 206 PAGE 224 W e illustrate the solution for the ILP formulations with the help of the DFG sho wn in Fig. 8.2. The ASAP schedule is sho wn in Fig. 8.2(a) and the ALAP schedule is sho wn in Fig. 8.2(b). From the ASAP and ALAP scheduling we obtained the mobility graphs sho wn in Fig. 8.2(c) and Fig. 8.2(d) for MVDFC and MVMC schemes respecti v ely Using these mobility graphs, we get the ILP formulations. W e solv ed the formulation using LPsolv e and based on the results, we obtained the scheduled DFG. In this MVMC case, the mobility graph considers the multic ycle operations. In this illustration, we assume that we ha v e tw o operating v oltage le v els, and when the multipliers are operated at lo wer v oltage, the y tak e tw o clock c ycles. It should be noted that the mobility graph will depend on the number of operating v oltages and the assumed operating frequenc y 8.5 Experimental Results In this section we discuss the e xperiments conducted for the scheduling schemes proposed in the pre vious sections. The ILP based schedulers for all three schemes (SVSF MVDFC and MVMC) are tested with v e benchmark circuits :3Example circuit (EXP) (8 nodes, 3*, 3+, 9 edges)3FIR lter (11 nodes, 5*, 4+, 19 edges)3IIR lter (11 nodes, 5*, 4+, 19 edges)3HAL dif ferential equation solv er (13 nodes, 6*, 2+, 2, 1, 16 edges)3AutoRe gressi v e lter (ARF) (15 nodes, 5*, 8+, 19 edges ). The follo wing notations are used to e xpress results are gi v en in T able 8.3. W e use the lookup table method for a v erage switching capacitance calculation. The lookup table construction consists of tw o phases, such as input pattern generation and cell characterization. W e generate the primary input signal of dif ferent correlations using the autore gressi v e mo ving a v erage (ARMA) model [169 ]. W e perform the characterization of the physical implementations of the library modules a v ailable in [55 ] by applying the the input patterns generated abo v e for kno wn v alues of (m @ :hgm C) pairs. Whene v er necessary we used interpolation method to nd the 207 PAGE 225 T able 8.3. Notations used in Describing the Results *+%1, : the mean po wer gradient (in ) for SVSF operation *+%1, : the mean po wer gradient (in ) for MVDFC operation *+%1, O: the mean po wer gradient (in ) for MVMC operation % r : the peak po wer consumption (in ) for SVSF operation % r : the peak po wer consumption (in ) for MVDFC operation % r O: the peak po wer consumption (in ) for MVMC operation % k : the a v erage po wer consumption (in ) for SVSF operation %k : the a v erage po wer consumption (in ) for MVDFC operation %k O: the a v erage po wer consumption (in ) for MVMC operation : the critical path delay (in) for SVSF operation : the critical path delay (in) for MVDFC operation O: the critical path delay (in) for MVMC operation %'Y% : the po wer delay product (in) for SVSF operation %'Y% : the po wer delay product (in) for MVDFC operation./ % k 0 %'Y% O: the po wer delay product (in) for MVMC operation./ W%Tk O O 0 s% r : percentage peak po wer reduction for MVDFC operation./ n t n v n t P4"D"0 s% r O: percentage peak po wer reduction for MVMC operation./ n t n n t 4"D"0 s%' % : percentage PDP reduction for MVDFC operation./ n n t n n v n n t 4"D"0 s%' % O: percentage PDP reduction for MVMC operation./ n n t n n n n t P4"D"0 a v erage switching capacitance for an y other v alues of (m @ :hgm C) pairs that does not e xist in the lookup table. It should be noted that lar ger the size of lookup table, better is the accurac y Our lookup table has 100 pairs of entries for (m @ :hgm C). The abo v e generated signals are propagated through dif ferent operators in the DFG and the a v erage switching acti vities are calculated as described in [169 ]. The schedulers were tested using dif ferent sets of resource constraints (RC1,RC2,RC3,RC4,RC5) sho wn belo w multipliers (atA9andatZ Z 9) and ALUs (atA9andatZ Z 9) multipliers (ZatA9) and ALUs (atA9andatZ Z 9) multipliers (atA9) and ALUs (atZ Z 9) multipliers (atA9andatZ Z 9) and ALUs (atZ Z 9) multipliers (atA9) and ALUs (atZ Z 9) 208 PAGE 226 T able 8.4. Po wer Estimates for Benchmarks MPG Estimates ( ) Peak Po wer (%) A v erage Po wer (%) PDP (%) E E 1 2 3 4 5 6 7 8 9 10 11 12 e 8.42 2.11 74.94 5.96 29.22 73.61 0 72.80 22.91 54.58 0 x 8.42 2.11 74.94 5.97 29.10 73.61 20.83 72.80 21.56 54.58 0 p 8.42 2.06 75.53 2.17 74.23 73.61 47.22 72.12 36.68 53.56 0 f 4.26 1.11 73.94 3.53 17.14 73.61 0 73.47 15.65 52.24 0 i 6.42 1.72 73.21 4.54 29.28 73.61 47.22 73.47 12.93 52.24 0 r 4.26 1.08 74.65 3.00 29.58 73.61 45.90 72.9 24.72 51.22 0 i 8.56 2.92 65.89 4.41 48.48 65.74 31.48 68.33 18.78 52.24 0 i 8.56 2.24 73.83 2.71 68.34 73.61 47.22 72.96 30.13 59.60 0 r 4.26 1.08 74.65 1.27 70.19 73.61 47.22 72.34 34.13 55.71 0 h 8.49 2.85 66.43 3.53 58.42 65.74 31.48 69.26 32.55 46.09 0 a 8.56 2.19 74.42 4.52 47.20 73.60 47.20 73.18 30.14 53.06 0 l 4.26 1.06 75.12 1.63 61.74 73.33 45.35 72.71 24.64 50.85 0 a 5.66 1.46 74.20 2.92 48.41 73.59 0 74.00 22.00 59.40 0 r 5.66 1.46 74.20 3.00 47.00 73.59 0 74.00 20.44 59.40 0 f 5.66 1.40 75.27 2.97 47.53 73.02 0 71.33 18.89 57.20 0 A v erage Results 73.42 47.10 72.50 27.41 72.38 24.41 54.13 0 209 PAGE 227 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >MPG Reduction (%) > 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 0 20 40 60 80 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 0 10 20 30 40 50 60 Different Benchmark Circuits >PDP Reduction (%) > Figure 8.3. A v erage Reductions using DFC Scheme The reason behind choosing the sets of resource constraints is that it co v ers a good representi v e of types of resources at dif ferent operating v oltages. The number of allo w able v oltage le v els being tw o (A9: Z Z 9) and maximum number of allo w able frequencies being three. The e xperimental results for v arious benchmark circuits are reported in T able 8.4 for all three schemes for resource constraints RC2, RC3, and RC5. The po wer estimation step includes the po wer consumption of the o v erheads. In case of MVDFC scheduling the frequencies found out areD* :hSD*and7RD*. F or MVMC and SVSF scheduling scheme the operating frequenc y (r C ) isSD*. The table also reports the a v erage reduction for dif ferent benchmarks a v eraged o v er all resource constraints. It is ob vious from the table that the reductions using MVDFC scheme are appreciable, on the other hand, for the MVMC scheme there is no reduction in%' %. The a v erage results o v er all v e resource constraints are sho wn in Fig. 8.3 and 8.4. 210 PAGE 228 1 2 3 4 5 0 10 20 30 40 50 60 Different Benchmark Circuits >MPG Reduction (%) > 1 2 3 4 5 0 10 20 30 40 Different Benchmark Circuits >Peak Power Reduction (%) > 1 2 3 4 5 0 5 10 15 20 25 30 Different Benchmark Circuits >Avg Power Reduction (%) > 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Different Benchmark Circuits >PDP Reduction (%) > Figure 8.4. A v erage Reductions using Multic ycling Scheme In order to study the po wer consumption per c ycle, we plotted the po wer prole for dif ferent benchmarks o v er all the control steps (clock steps). Fig. 8.5, 8.6 and 8.7 sho w po wer prole for benchmarks for resource constraints RC2, RC3, and RC5 respecti v ely The curv es labeled as SF correspond to the prole when the schedule is operated at a single frequenc y (which is the maximum frequenc y of slo wer operator multiplier) and single v oltage. The proles labeled as DFC correspond to the case when dynamic clocking and multiple v oltage scheme is used. Similarly the proles labeled as MC is for the MVMC scheme. The ef fecti v eness of the proposed scheduling schemes is ob vious from the gures. 211 PAGE 229 1 2 3 4 0 5 10 15 20 (1) EXP SF DFC MC Control steps >Cycle power profile > 1 2 3 4 5 6 0 5 10 15 20 25 30 (2) FIR SF DFC MC Control steps >Cycle power profile > 1 2 3 4 5 6 0 5 10 15 20 25 30 (3) IIR SF DFC MC Control steps >Cycle power profile > 1 2 3 4 5 6 0 5 10 15 20 25 30 (4) HAL SF DFC MC Control steps >Cycle power profile > Figure 8.5. Po wer Profiles for Benchmarks (for RC2) 1 2 3 4 5 0 5 10 15 20 (1) EXP SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (2) FIR SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (3) IIR SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (4) HAL SF DFC MC Control steps >Cycle power profile > Figure 8.6. Po wer Profiles for Benchmarks (for RC3) 212 PAGE 230 1 2 3 4 5 0 5 10 15 20 (1) EXP SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (2) FIR SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (3) IIR SF DFC MC Control steps >Cycle power profile > 0 2 4 6 8 0 5 10 15 20 (4) HAL SF DFC MC Control steps >Cycle power profile > Figure 8.7. Po wer Profiles for Benchmarks (for RC5) 8.6 Conclusions The reduction of c ycle po wer uctuation is important for a CMOS circuit. This paper addresses po wer uctuation reduction at the beha vioral le v el using lo w po wer datapath scheduling techniques. Three datapath scheduling schemes, (i) using single supply v oltages and single frequenc y (SVSF), (ii) using multiple supply v oltage and dynamic clocking (MVDFC) and (iii) using multiple supply v oltage and multic ycling (MVMC) ha v e been introduced. W e used ILP based optimizations for the three modes of datapath operations. The results of MVDFC and MVMC schemes were compared with that of SVSF scheme. In dynamic frequenc y clocking scheme signicant reduction could be achie v ed in mean po wer gradient, peak po wer and a v erage po wer alongwith reductions in po wer delay product. The r esults clearly indicate that the dynamic fr equency cloc king is a better sc heme than the multicycling appr oac h for power minimization. The ef fecti v eness of the scheduling schemes in the conte xt of pipelined datapath and control intensi v e applications need to be in v estigated. 213 PAGE 231 CHAPTER 9 VLSI DESIGN FOR DIGIT AL W A TERMARKING OF IMA GES The research in digital w atermarking is well matured. Se v eral w atermarking algorithms ha v e been proposed for image, video, audio and te xt in the current literature. Digital W atermarking is the process that embeds data called a w atermark into a multimedia object such that w atermark can be detected or e xtracted later to mak e an assertion about the object. The softw are implementation of the proposed algorithms are signicantly lar ge, whereas the hardw are implementation of the algorithms is lacking. The hardw are implementation has adv antages o v er the softw are implementation in terms of lo w po wer high performance, and reliability In this chapter we de v elop hardw are system that can insert in visible rob ust, in visible fragile and visible w atermark in the image. The hardw are module can be easily incorporated in JPEG encoder to de v elop a secure JPEG encoder An outline of such an secure JPEG encoder is pro vides in Fig. 9.1 [176 ]. The secure JPEG codec can be a part of a scanner or a digital camera so that the digitized images are w ateramark ed right at the origin. The proposed w atermarking chip can also directed inte grated with an y e xisting digital still camera. W e pro vide the schematic vie w of a still camera ha ving inb uilt w atermarking chip in Fig. 9.2, call such an camera as a secure digital still camera (SCDC). This chapter is or ganized as follo ws. W e rst discuss design and implementation of spatial domain in visiblerob ust and in visiblefragile w atermarking chip. F ollo wed by a design and implementation of a chip that can insert one or tw o of visible w atermarks in an image in spatial domain. Finally a DCT domain visible and in visiblerob ust w atermarking chip has been discussed. 9.1 In visible W atermarking in Spatial Domain In this section, we propose a VLSI architecture [176 ] that can insert both in visiblerob ust and in visiblefragile w atermarks in spatial domain. Depending on the user' s requirement, it can insert 214 PAGE 232 Table Quantization Watermark Insertion Module Watermark Input Image Encoder Model Image Compressed Quantizer Entropy Encoder DCT (a) Spatial Domain W atermark DCT Watermark Insertion Module Watermark Table Quantization Input Image Encoder Model Image Compressed Quantizer Entropy Encoder (b) DCT Domain W atermark Figure 9.1. Secure JPEG Encoder : Block Le v el V ie w [176 ] Controller Interface and Watermarking Controller Input Memory (Flash, SDRAM) DSP Processor Image Sensors A/D Converter Output Watermarking Processor Watermarking Datapath Figure 9.2. Secure Digital Still Camera : Schematic V ie w 215 PAGE 233 either of the w atermarks or both. The follo wing w atermarking insertion algorithms are implemented : (i) the in visiblerob ust algorithm from [177 178 ] and (ii) the in visiblefragile algorithm proposed by the authors from [83 72 ]. Both the algorithms are quite dif ferent and are proposed recently 9.1.1 Spatial Domain In visible W atermarking Algorithms In this section, we describe the algorithms (in visiblerob us t and in visiblefragile) chosen for VLSI implementation. W e outline the insertion and detection methods in brief with the modications necessary to f acilitate the hardw are implementation. The notations needed for stating the algorithms are gi v en in T able 9.1. T able 9.1. Notations used to Explain Spatial Domain W atermarking Algorithms U: Original image (gray image) : W atermark image (binary or ternary image) :00: A pix el location U : W atermark ed image p G p: Image dimension p G p : W atermark dimension ;:<;@7:<;PC: W atermark embedding functions : W atermark detection function : Neighborhood radius U i: Neighborhood image (gray image) : Digital (w atermark) k e y @4:hC: Scaling constants (w atermark strength) 9.1.1.1 In visible Rob ust Algorithm A block diagram of the w atermark insertion scheme is sho wn in Fig. 9.3(a) [177 178 ]. The w atermarkis a ternary image ha ving pix el v aluesJ0,1 or 2_. These v alues are generated using the digital k e y. The w atermark insertion is performed by altering the pix els of original image as 216 PAGE 234 Watermark Embedding Generation Watermark Watermark Ternary Watermark Input Image Watermark Key Power Watermarked Image (a) W atermark Insertion Watermark Generation Watermark Watermark Ternary Watermark Image Key Test Threshold Detection Authentic ? (b) W atermark Detection Figure 9.3. In visible Rob ust W atermarking in Spatial Domain [177 178 ] follo ws.U :00T U :00if :0A0T ?" ;'@ U :0A0: U i :00 if :0A0T ;PC U :0A0: U i :00 if :0A0T (9.1) The encoding functions;1@and;PCare dened as follo ws, where@ X "andC X ".; @ U : U i 0 .}d @ 0 U i :00iO @ U :00 ;PCD. U : U i 0 .}d@£0 U i :00TdC U :00(9.2) It may be noted that the abo v e functions are slightly dif ferent from the original algorithm, whereCis ne gati v e and the second encoding function in v olv ed addition, instead of subtraction. Ho we v er these changes do not af fect the o v erall encodingdecodin g scheme, since we mak e changes in decoding functions accordingly The neighborhood image pix el gray v alue is calculated as the a v erage gray v alue of the neighboring pix els of the original image for a particular neighborhood radius. F or e xample, for neigh217 PAGE 235 borhood radius a, it is calculated as :U i :0A0 m @B m @B @ C O U :0Oq0 (9.3) The scaling.}nd@0is used to scaleU ito ensure that w atermark ed image gray v alueU ne v er e xceeds the maximum gray v alue for 8bit image representation corresponding to pure white pix el. The neighborhood radiusdetermines the upper bound of the w atermark ed pix els in an image. It may be noted that a simple a v erage could ha v e been m @B m @B @ m @ , b ut we used the abo v e method of a v eraging to simplify the hardw are implementation, since the di vision by tw o can be implemented using a right shift by 1bit operation. The block diagram for w atermark detection is pro vided in Fig. 9.3(b). The rst step detection process is the generation of w atermarkusing the w atermark k e y. Ne xt, the w atermark is e xtracted from the test (w atermark ed) image using the detection function gi v en belo w ( :00 ifU :00d U i :00 X ifU :00d U i :00 "(9.4) By comparing the original ternary w atermark imageand the e xtracted binary w atermark image (, the o wnership can be established when the detection ratio is lar ger than a predened threshold as e xplained in [177 178 ]. 9.1.1.2 In visible Fragile Algorithm The in visible fragile w atermark insertion is carried out as follo ws (Fig. 9.4(a) [83 72 ]). A pseudorandom binarysequenceJ0,1_of periodpis generated using a linear shift re gister The periodpis equal to the number of pix els (p G p ) of the image. The w atermark is generated by arranging the binary sequence into blocks of size G"orRGHR. The size of the w atermark is the same as the size of the image. The bit planes of the input image are deri v ed and w atermark is inserted in the appropriate bit plane such that p aXthreshold. Assuming that the w atermark insertion is to be performed in  ybit plane, the w atermark insertion process is gi v en by the follo wing 218 PAGE 236 Watermarked Image Watermark Construction Bitplane Number Input Image Image Bitplane XOR Watermark Image Bitplane Merging (a) W atermark Insertion Watermark Insertion Watermark Construction Input Image Number Bitplane Image Test Watermark Detection Authentic ? (b) W atermark Detection Figure 9.4. In visible Fragile W atermarking in Spatial Domain [83 72 ] e xpression.U ")dcz. :0A0 U "1)d[cz. :00 U Uz. :00 U Uz. :0A0XOR :00 U 'ONP z. :0A0 U 'OP z. :00(9.5) The nding of the candidate bit plane for w atermark insertion is an iterati v e process. W e ha v e chosen the F f .2" a0bit plane as the candidate for w atermark insertion (for LSB" ="). After mer ging all the bit planes, the w atermark ed imageU is obtained. F or image authentication purpose, the testing paradigm pro vided in [83 72 ] is used. T o construct the testing paradigm, the crosscorrelations of the original image and the w atermark image, and the crosscorrelations of the w atermark ed image and the possibly for ged test image are calculated. Then, based on the crosscorrelations the test statistics is determined. The test statistics is the basis of the test paradigm. 219 PAGE 237 a 1 a 2 a 1 (1) MUX 2 x 1 MUX 2 x 1 MUX 2 x 1 Adder / Subtractor Adder 1 Adder 2 Multiplier 2 Multiplier 1 Shift Register Address Decoder RAM Image P3 P2 P0 P18 8 8 8 8 8 8 Address Decoder IM_DATA_IN IM_DATA_SEL Watermark RAM WM_DATA_SEL WM_DATA_IN Figure 9.5. Datapath for Rob ust W atermarking 9.1.2 VLSI Ar chitectur e f or In visible Spatial Domain W atermarking In this section, we discuss the proposed architectures for the algorithms discussed in the pre vious section. 9.1.2.1 Ar chitectur e f or Rob ust W atermarking The datapath for in visible rob ust w atermarking is sho wn in Fig. 9.5. The image RAM is used to store the original image, which is to be w atermark ed. The image data can be written to the image RAM by acti v ating proper control signals. The w atermark RAM serv es as a storage space for w atermark data. The w atermark data can either be generated using the shift re gister or gi v en as an e xternal input by the user In this hardw are design, it is assumed that at an y point of time, aDD1G DDimage can be stored in the image RAM and a7DRGH7DRw atermark can be stored in the w atermark RAM. It is possible to w atermark only a7DRiG7DRre gion of the original image at a time, whereas the full image can be w atermark ed if the process is repeated for the other re gions (total in four times for the assumed size). The re gion of the original image to be w atermark ed is described in terms of v e parameters, such as top left, top right, center bottom left, and bottom right and address decoders are used to determine the proper locations. 220 PAGE 238 MUX 2 x 1 MUX 2 x 1 RAM Image P3 P2 P0 P1XOR 1 1 Shift Register Address Decoder WM_DATA_IN WM_DATA_SEL Decoder Address Watermark RAM IM_DATA_IN IM_DATA_SEL Figure 9.6. Datapath for Fragile W atermarking The in visible rob ust w atermark insertion scheme in v olv es adding (or subtracting) a constant time the image pix el gray v alue to (from) a constant time of the neighborhood function. The constants are @and C, the v alues of which determine the strength of the w atermark. The four output lines from the image RAM pro vide the pix elsU :0A0,U :0OMq0,U Oa:00andU O :0O q0for the ro wcolumn address pair. :0A0. The neighborhood function specied by Eqn. 9.3 is computed as follo ws. First, theU :0)Oaq0andU O=:0Oq0are gi v en to the adder1 as input. The resulting sum and carry out from adder 1 are fed to adder 2 alongwithU OM:00. The resulting sum of the adder 2 is the neighborhood function v alue. The di vision by tw o is performed by shifting the results bit right by one bit, consequently discarding the rightmost bit (LSB). The scaling of the neighborhood function is achie v ed by multiplying it with.}dH@£0using the multiplier 2. At the same time, the scaling of the image pix el gray v alues is performed in multiplier 1 by multiplyingU :00with Cor @. The eight higher order bits of the the multipliers are fed to the adder/subtractor unit to perform w atermark insertion as per the Eqn. 9.2. Since, we are concerned only with the inte ger v alues of the pix els, the lo wer eight bits of the multiplier results are discarded, which represent the v alues after the decimal point. The output of the adder / subtractor unit (w atermark ed image pix els) and the original image pix el v alues are multiple x ed 221 PAGE 239 a 1 a 2 a 1 (1) MUX 2 x 1 MUX 2 x 1 MUX 2 x 1 XOR ROBUST/FRAGILE MUX 2 x 1 Adder / Subtractor Adder 1 Adder 2 Multiplier 2 Multiplier 1 Shift Register Address Decoder RAM Image P3 P2 P0 P18 8 8 8 8 8 8 Address Decoder IM_DATA_SEL IM_DATA_IN Watermark RAM WM_DATA_IN WM_DATA_SEL 1 8 1 Figure 9.7. Datapath F or Combined Spatial Domain In visible Rob ust / Fragile W atermarking based on the w atermark v alues and are written into the image RAM if the w atermark v alue is 1 or 2, as per Eqn. 9.1. 9.1.2.2 Ar chitectur e f or Fragile W atermarking The datapath for fragile w atermark insertion is sho wn in Fig. 9.6. The original image is stored in the image RAM and the w atermark is created in the same w ay as in the case of rob ust w ater marking described abo v e and is stored in the w atermark RAM. F or w atermark insertion, the F fbitline of the image pix els is fed as input to an XOR gate alongwith that of the w atermark v alue. The output of the XOR gate is returned to the image RAM and the F fbitline is o v er written by selecting appropriate control signals. 9.1.2.3 Ov erall Chip Ar chitectur e The combined datapath for both rob ust and fragile w atermarking is sho wn in Fig. 9.7. The datapath is obtained by stitching the tw o datapaths from (Fig. 9.5 and Fig. 9.6) using multiple x ers, which in turn gi v e rise to additional control signals. The controller that dri v es the datapath is 222 PAGE 240 S0 S1 S2 S4 S3 START = 0 START = 1 WM_COMPLETED = 0 WM_COMPLETED = 1 IM_COMPLETED = 0IM_COMPLETED = 1read/create watermark Read image and Display the watermarked image Initial state Write watermarked pixels Perform watermarking IM_COMPLETED = 1 IM_COMPLETED = 0 Figure 9.8. Controller F or Combined Spatial Domain In visible Rob ust / Fragile W atermarking sho wn in Fig. 9.8. The controller has v e states, such as S0, S1, S2, S3 and S4. The state S0 is the initial sate. In state S1, the image and w atermark data are written into the respecti v e RAMs. The image and the w atermark pix els are read from the RAMs in state S2 and w atermarking insertion is performed. In state S3, w atermark ed pix els are written back to the image RAM. In state S4, the w atermark ed image is ready in the RAM. The control signals and their functional descriptions are gi v en in T able 9.2. 9.1.3 Implementation of Spatial Domain In visible W atermarking Chip In this section, we discuss the implementation of the inte grated architecture which combines the tw o architectures from the pre vious section. The implementation of the w atermarking datapath and controller w as carried out in the physical domain using the Cadence V irtuoso layout tool using bottomtotop hierarchical design approach. The design in v olv ed the construction of three main modules, the memory the w atermarking module (datapath) and the controller unit. Each of the three modules were designed indi vidually through modularization and later interf aced with each other The layouts of the gates at the lo west le v el of hierarachy are dra wn using the CMOS standard 223 PAGE 241 T able 9.2. Control Signals for Spatial Domain In visible W atermarking Chip IM ADDR COUNT : increment signal for the counters used to generate address for image WM ADDR COUNT : incre. signal for the counters used to generate address for w atermark IM READ /WRITE : image RAM read (1) or write (0) WM READ /WRITE : w atermark RAM read or write IM D A T A SELECT : select input or w atermark ed image WM D A T A SELECT : select input or generate w atermark IM ADDR SELECT : select location of image WM ADDR SELECT : select address of w atermark ST AR T : w atermarking be gins when set to 1 IM COMPLETED : set to 1 when all the pix els of the image are co v ered WM COMPLETED : set to 1 when all the pix els in w atermark are co v ered B USY : high as long as the w atermarking process continues D A T A READ Y : high when w atermark ed image is ready to be read R OB UST/ FRA GILE : choose between rob ust or fragile cell design approach. W e designed a standard cell library containing basic gates, such as AND, OR, NO T and 1bit RAM cell. The memory module in v olv es tw o read/write memory structure, one forDDGsDDsize original/w atermark ed image and other for7DRG!7DRsize w atermark. The bit size for the image RAM isR#dbits and for the w atermark RAM, it is#dbits. The basic b uilding block for a memory module is a#dtransistor static RAM cell a v ailable in the cell library W e ha v e chosen a SRAM instead of a DRAM due to its shorter read and write c ycles. The memories are b uilt as Garrays of SRAM cells and are addressed using ro w and column address decoders. Each decoder is implemented as adbit counter with additional ANDlogic to address lcells. The w atermarking module (datapath) in v olv es the implementation of tw o w atermarking algorithms as described in Section 9.1.1. The main components of this module are tw o 8bit adders, tw o 8bit multipliers and a 8bit adder/ subtractor Each adder is constructed using 1bit adders in a ripplecarry manner The adder/subtractor unit is obtained from the adder using XOR gates. The carry inputs to the adder/ subtractor and one of the inputs to the XOR gate are set to high whene v er the w atermark pix el v alue is 2 so that a subtraction is carried out as required for the rob ust w atermarking encoding function (Eqn. 9.2). An 8bit parallel array multiplier is b uilt using fulladders and AND gates to implement multiplication operations with reduced delay 224 PAGE 242 Se v eral multiple x ers are used at appropriate places in the design to select one of the incoming lines. Each of such multiple x er is implemented using a combination of transmission gates. Three asynchronously resettable re gisters are designed to encode the v e states of the controller depicted in Fig. 9.8. At an ytime, the three re gisters could be reset by the user to return the controller to its intial state and from there, the w atermarking function could be started afresh. (a) Datapath Layout (b) Controller Layout Figure 9.9. Layout of the In visible Spatial Domain W atermarking Datapath and Controller T able 9.3. Po wer Area Details for Indi vidual Units Modules Gate Count Po wer.b 0Delay.b0 Datapath 4547 1.1931 0.9158 Controller 233 0.0045 0.3901 RAM 1183,744 21.8012 2.3891 Each of the abo v e mentioned modules is implemented and tested separately and then connected together to obtain the nal chip. The number of gates, po wer and areas of each module is sho wn in T able 9.3 for operating v oltage ofZ Z 9. The statistics are obtained using HSPICE for"# Z 'MOSIS SCN3M SCMOS technology It is e vident from the abo v e statistics that the RAM consumes 225 PAGE 243 Figure 9.10. Layout of RAM (Zoomed vie w of a portion is sho wn) most amount of po wer If we assume that the proposed chip is to be used as a module within a complete JPEG enoder then the memory module could be a v oided in the w atermarking datapath circuit. The layout of the datapath is sho wn in Fig. 9.9(a). and the layout of the controller is sho wn in Fig. 9.9(b). The layout of RAM is sho wn in Fig. 9.10. This sho ws a zoomed vie w of a small portion of the RAM. The complete layout and the oor plan of the w atermarking chip is gi v en in Fig. 9.11. The pin diagram for the chip sho wing the inputs and the outputs is gi v en in Fig. 9.12. The o v erall design statistics of the chip are in T able 9.4. T able 9.4. Ov erall Chip Statistics Area (with RAM)7A"7GDDe C Number of gates (with RAM)D7RDR Number of gates (without RAM)RD" Operating V oltageZ Z 9 Clock frequenc y (with RAM)7#7* Clock frequenc y (without RAM)D* Number of I/O pinsD Po wer (with RAM)D Po wer (without RAM)A" 226 PAGE 244 Figure 9.11. Layout of the Proposed Spatial Domain In visible W atermarking Chip 9.1.4 Results and Conclusions The v erication of the chip implementation w as performed by w atermarking on se v eral test images, e xamples of which are sho wn in Fig. 9.13 and Fig. 9.14. The visual inspection of the images illustrate the quality of the w atermarking. As a quantitati v e measure of the perceptibility of the w atermark, we used the e xpression for signaltonoise ratio gi v en in Eqn. 9.6 as suggested by ROBUST/FRAGILE SPATIAL DOMAIN WM_DATA_SELECT ENCODER WATERMARKING INVISIBLE DATA_OUT BUSY DATA_READY IM_DATA_IN WM_DATA_IN START RESET CLOCK Figure 9.12. Pin Diagram for the Proposed Spatial Domain In visible W atermarking Chip 227 PAGE 245 (a) Original Shuttle (b) Rob ust W atermark ed (c) Fragile W atermark ed Figure 9.13. Spatial Domain In visible W atermark ed Shuttle (a) Original Bird (b) Rob ust W atermark ed (c) Fragile W atermark ed Figure 9.14. Spatial Domain In visible W atermark ed Bird [159 83 72 ]. p =4" q #V ar V ar %(9.6) The V aris the v ariance of the original input image and the V ar is the v ariance of the error image (dif ference between original input image and w atermark ed image). W e calculated the p using the original and the w atermark ed image with the help of a softw are simulator The p for v arious w atermark ed images were in the range of Z 6U;=dD6E;. In this w ork, we presented a w atermarking encoder that can perform in visible rob ust, in visible fragile w atermarking and the combination of both in spatial domain. T o our kno wledge, this is the 228 PAGE 246 rst w atermarking architecture ha ving both functionalities. The chip can be easily inte grated in an y e xisting JPEG encoder to w atermark the images right at the source end. The disadv antage of the w atermarking algorithms implemented is that the processing needs to be done pix el by pix el. In future, we are aiming to in v estigate block by block processing. The implementation of a lo w po wer high performance w atermarking decoder which will be a part of JPEG decoder is currently under implementation. 9.2 V isible W atermarking in Spatial Domain In this section, we present a ne w VLSI architecture for tw o visible w atermarking schemes presented in the literature. W e implement the VLSI architecture using"# Z 'CMOS technology The proposed w atermarking chip is designed aiming at easy inte gration with an y e xisting digital camera frame w ork [179 ]. T o our kno wledge, this is the rst w atermarking chip implementing visible w atermarking schemes. 9.2.1 W atermarking Algorithms In this section, we discuss the image w atermarking algorithms whose VLSI architecture is proposed. W e outline the schemes in brief with the modications necessary to f acilitate the hardw are implementations. The follo wing notations are needed for description of the algorithms. 9.2.1.1 V isible W atermarking Algorithm 1 : In this subsection, we discuss the visible w atermarking algorithm proposed in [73 ]. The w atermark has three goals, such as, (i) the visible w atermark should identify the o wnership, (ii) the visual quality of the host image should be preserv ed, (iii) the w atermark should be dif cult to remo v e from the host image. T o satisfy these three conicting criteria, schemes ha v e been proposed for adding w atermark with the or ginal image. The w atermark ed image is obtained by adding a scaled gray v alue of the w atermark image to the host image. The amount of scaling is done in such a w ay that the alternation of each original image pix el occurs to a perceptual equal de gree. The 229 PAGE 247 T able 9.5. List of V ariables used in Algorithm Explanation U: Original (or host) image (a grayscale image) : W atermark image (a grayscale image) .b:B0: A pix el location U : W atermark ed image p G p : Original image dimension p G p : W atermark image dimension : The# yblock of the original imageU : The  yblock of the w atermark image F: The# yblock of the w atermark ed imageU : Scaling f actor for  yblock (used for host image scaling) : Embedding f actor for yblock (used for w atermark image scaling) : Mean gray v alue of the original imageU : Mean gray v alue of the original image block : V ariance of the original image block lkh: The maximum v alue of lTmF: The minimum v alue of lk<: The maximum v alue of lTmF: The minimum v alue of U y4m  : Gray v alue corresponding to pure white pix el : A global scaling f actor $@4:$oC:$ :$ : Linear re gression coef cients original formulas ha v e been simplifed to the follo wing [75 ].U .b:B0 U .b:B0gO .b:B0 } rc l F } z for l F } X "#"D"RDRDD U .b:B0gO .b:B0 l F T for l F } ["#"D"RDRDD(9.7) The scaling f actor determines the strength of w atermark. Our aim is to implement the w atermarking algorithms in a hardw are. The abo v e equation is simplied so that the hardw are implementation becomes easier At the same time, care is tak en to mak e sure that the hardw are is as accurate as the softw are implementations. W e assumeU y4m  P DDand simplify the abo v e equations to the follo wing.U .b:B0 U .b:B0gO 5 c T .b:B0o. U .b:B0B0 z forU .b:B0 X ADDR Z U .b:B0gO 5 T .b:B0 U .b:B0forU .b:B0t[ADDR Z(9.8) 230 PAGE 248 The abo v e e xpression in v olv es cubic root calculation, which could complicate the hardw are implementation. So, we further simplify the abo v e e xpressions and remo v e the cubic root function with a piece wise linear model. W e di vide the gray v alues range "#: U y4m  to four ranges, such as"#: } , } : } C , } C : } , and } : U y4m  . W e t four linear re gression coef cients that best approximates the cubic root in each of these ranges. Moreo v er we roundup the fraction in v olv ed in the comparison operation and the nal simplied e xpression that is implemented using hardw are is as follo ws.U .b:B0 U .b:B0O T .b:B0 U .b:B0forU .b:B0o[ U .b:B0O 5 c T .b:B0 U .b:B0for U .b:B0t[ U .b:B0O 5 z c T .b:B0 U .b:B0for U .b:B0oW7DR U .b:B0O 5 c T .b:B0 U .b:B0for7DR U .b:B0tW7SD U .b:B0O 5 < c T .b:B0 U .b:B0for7SD U .b:B0 DD(9.9) 9.2.1.2 V isible W atermarking Algorithm 2 : In this subsection, we discuss the visible w atermarking algorithm proposed in [83 ]. The pix el gray v alues are modied based on local and global statistics. The w atermaking insertion process consists of the follo wing steps.3Both host image (one to be w atermark ed)Uand the w atermark (image)are di vided into blocks of equal sizes (the tw o images may be of unequal size).3Let denote the# yblock of the original imageUand denote the# yblock of the w atermark. F or each block ( ), the local statistics; mean' and v ariance are computed. The image mean gray v alue' is also found out.3The w atermark ed image block is obtained by modifying as follo ws. Assuming that and are scaling and embedding f actors respecti v ely depending on' and of each host image block. N O8 V =:hAEE(9.10) 231 PAGE 249 The choice of and are go v erned by certain characteristics of human visual system (HVS) and mathematical models are proposed so that the perceptual quality of the image are not de graded due to w atermark addition. The and are obtained as follo ws.3The and for edge blocks are tak en to beilok<andlTmFrespecti v ely .3The and are found out using the follo wing equations. @ L7 d^. d 0 C dL4 d1. d 0 C (9.11) Where, and are normalised v alues of' and' , and are normalised logarithm v alues of .3The and are scaled to the ranges (ilTmF,lk<) and (6lmF,6lok<) respecti v ely wherelTmFandlkhare minimum and maximum v alues of scaling f actor andglTmFandlkhare minimum and maximum v alues of embedding f actor These parameters determine the e xtent of w atermark insertion. A linear tranformation is used to scale current and v alues to the ranges ( lTmF, lkh) and ( lmF, lok<), respecti v ely Let current v alues of be written as , and lTmFand lkh, respecti v ely denote the current minimum and maximum v alues. Similarly let current v alues of be written as , and lTmFand lok<, respecti v ely denote the current minimum and maximum v alues. The and v alues are scaled as follo ws. § § O lkh'd § § lk< ~ § ~ ~ § ~ O 6lkh'd 6~ § ~ ~ § ~ lkh (9.12) W e used rstorder deri v ati v es for edge detection. F or horizontal edge detection, we compute the horizontal gradient as :, y .b:B0 U .b:B0d U .b O:B0(9.13) 232 PAGE 250 The v ertical gradient is computed as follo ws for v ertical edge detection., .b:B0 U .b:B0d U .b:BYOq0(9.14) The amplitude of an edge is calculated as,,.b:B0 M, y .b:B047OW, .b:B04(9.15) The mean amplitude for a block is computed as,,) p G p l F ,.b:B0(9.16) When the mean amplitude for a block e xceeds a predened threshold, we declare it as an edge block. The v alues ofandcorrespond to the pix el locations of indi vidual blocks with reference to the original image pix el location. The mean gray v alue of a block is calculated as the a v erage of gray v alues of all pix els in the image block. The mean gray v alues are normalized with pure white pix el gray v alue. Thus, we ha v e normalized mean gray v alues of a block as, p G p # U y4m  % l F U .b:B0(9.17) Where,andare the pix el locations of the  yimage block; same as their locations in the original image. The normalized standard de viation of gray v alues for the yblock is calculated as follo ws. p G p # U uy7m  % l F U .b:B0d U y4m  (9.18) The e xponential term in the Eqn. 9.11 is approximated as a po wer series. F or"a we ha v e the follo wing T aylor series approximation which w as used upto the square term in our implementation.L m m =OQ)O C OEE(9.19) 233 PAGE 251 In the step three of the insertion algorithm, scaling needs to be done using a linear transformation. The transformation needs to nd the current minimum and maximum v alues for both and o v er all the blocks to perform the transformation. Due to this the hardw are performance is going to be se v erely de graded since it has to w ait till all the pix els of the images are co v ered to nd local statistics of all the blocks. So, we modify the abo v e Eqn. 9.11 to ensure that the performance of the hardw are is impro v ed with no compromise on the quality W e nd and using the follo wing equations. lTmF OW.2 lok< d lTmF 0 @ L4 d^. d 0 C 6lmF'ON.clk<'d6lTmF60 doL4 d^. d 0 C q(9.20) Extensi v e simulations for v arious images sho w that the and obtained using Eqn. 9.12 and Eqn. 9.20 are comparable (maximum dif ference is[72 ]). Thus, we use Eqn. 9.20 for the and calculations. 9.2.2 VLSI Ar chitectur e In this section, we discuss the architectures proposed for the hardw are implementations of the algorithms described in Section 9.2.1. W e discuss the implementation of the rst algorithm and the architecture of the second algorithm in the rst subsection and the second subsection respecti v ely The abo v e tw o architectures are stitched to de v elop the proposed w atermarking datapath. The FSM based design of a controller that dri v es the datapath is outlined. W e assume that both the original host image and the w atermark image are stored in some memory in the digital camera frame w ork and are a v ailable for processing. The images may be in some compression format or may be a v ailable in ra w ascii data. W e need to ha v e a corresponding decoder to decode the image and get the uncompressed data in case it is in compressed format. The decoder implementation is not a part of this research. 9.2.2.1 Ar chitectur e f or Algorithm 1 : The insertion operation for the rst w atermarking algorithm is described in Eqn. 9.7. This insertion function is simplied to Eqn. 9.9 using a piece wise linear model such that we ha v e a 234 PAGE 252 Comparator Register File Multiplier Multiplier Multiplier Adder W I (m,n) a I I(m,n) W(m,n) (a) F or Algorithm 1 a k b k Edge Detection Unit 0 1 0 1 min b max a Multiplier Multiplier Adder W I (m,n) a k b k and Calculation Unit I(m,n) W(m,n) (b) F or Algorithm 2 Figure 9.15. Datapath Architectures for the V isible W atermarking Algorithms compact and ef cient hardw are design, as described in the pre vious section. Fig. 9.15(a) sho ws the architecture proposed for the rst algorithm. The w atermarking in this scheme is performed pix elbypix el as e vident from the insertion function. A re gister le is used to store the constants needed to scale the imagew atermark product in Eqn. 9.9. W e store the constants@ T , c T , z c T , c T , and< c T . The other constant is assumed as a parameter which can be changed user to v ary the w atermark strength. The comparator is used to determine the range in which a particular pix el gray v alue lies, such that an appropriate constant can be pick ed up from the re gister le. The left side multiplier calculates appropriate constant times the host image pix el gray v alues and the right side multiplier is used to nd times the w atermark image pix el gray v alue. The results of the abo v e tw o multiplier is fed to the third multiplier which ef fecti v ely calculates the product of constants, , host image pix el gray v alue, and w atermark image pix el gray v alue, respecti v ely The abo v e product is added to the host image pix el gray v alues using the adder to obtain w atermark ed image pix el gray v alues. The abo v e described process has to be carried out for all the pix els in order to obtain the w atermark ed image. 235 PAGE 253 9.2.2.2 Ar chitectur e f or Algorithm 2 : The proposed architecture for the second algorithm is sho wn in Fig. 9.15(b). Using the second algorithm the w atermarking insertion is performed blockbyblock as described in Eqn. 9.10. But, for each block the w atermarking insertion has to be carried out pix elbypix el. The proposed architecture in Fig. 9.15(b) present the operation at pix el le v el. The and calculation unit computes the and v alues for the  ynonedge block using e xpression in Eqn. 9.20. The edge detection unit determines if a block is an edge block or nonedge block if the,e xceeds a user dened threshold, then it is an edgeblock. Lar ger the threshold more are the blocks declared as edgeblocks. The multiple xors help in selecting the scaling and embedding f actors between the edge and nonedge blocks. The left side multiplier calculates the scaling f actors times the host image pix el gray v alue. The right side multiplier multiplies the embedding f actor with the w atermark image pix el gray v alue. The products from these tw o multipliers are added using an adder to nd the w atermark ed image pix el gray v alue. This process is repeated for all pix els in a block, and subsequently for all the blocks in the image. and calculation unit : The architectural details of and calculation unit is sho wn in Fig. 9.16(a). This hardw are implements Eqn. 9.20 for and calculation for a block at a time. The left side adder accumulator combination nds the sum of all the image pix el gray v alues for a block. After the sum is multiplied with @ in PAGE 254 Adder / Subtractor Adder / Subtractor Adder Accumulator Multiplier I k m<( 0.5 ) I k m< Multiplier 16384 1 Adder / Subtractor Adder Accumulator 1 8192 I k s< Multiplier Multiplier a max a min ( ) Multiplier b min Adder a min Adder b max b min ( ) b k a k I(m,n) 0.5 1 Divider 128 Exponential Unit (a) Architecture of andr Calculation Unit Adder / Subtractor Adder / Subtractor Adder Adder Accumulator Multiplier 1 64 Comparator Threshold Amplitude m G I(m+1,n) I(m,n) I(m,n+1) G(m,n) Edge or Nonedge Block (b) Architecture of Edge Detection Unit Figure 9.16. Indi vidual Datapath Units for Algorithm 2 The adder/subtractor unit nds the image pix el gray v alue absolute de viation from } C. The adder accumulator follo wing this accumulate the l F U .b:B0d } C for a block. When this sum is multiplied with @ in PAGE 255 Edge detection unit : The architecture used to declare if a block is an edge or nonedge block is sho wn in Fig. 9.16(b). The left side and right side calculate the absolute v alue of horizontal gradient, y .b:B04and absolute v alue of v ertical gradient, .b:B04, respecti v ely The amplitude of an edge,V.b:B0is calculated using the rst adder Then, the adder accumulator combination nds the sum of,.b:B0for all pix els of a block. The abo v e sum when multiplied with @ in PAGE 256 a k b k Edge Detection Unit 0 1 0 1 min b max a a k b k and Calculation Unit 0 1 Register File Comparator 0 1 a I Multiplier Multiplier Multiplier 0 1 0 1 I(m,n) W(m,n) Select Adder W I (m,n) (a) Mer ged Datapath for Algorithms 1 and 2 Read Pixel Read Block Write Block Display Image Write Pixel Init BlockCompleted=1 BlockCompleted=0 BlockCompleted=1 ImageCompleted=1 BlockCompleted=1 ImageCompleted=1 ImageCompleted=1 ImageCompleted=0 ImageCompleted=0 BlockCompleted=0 Start=0 Select=1 Start=1 Start=1 Select=0 ImageCompleted=0 (b) Controller for the Mer ged Datapath Figure 9.17. Architecture for the Proposed W atermarking Processor whole image, the ImageCompleted signal is set to 1; thus, completing the w atermarking process. State DisplayImage is the state at which the w atermark image is ready in the digital camera storage. 9.2.3 Chip Implementation The implementation of the w atermarking datapath and controller w as carried out in the physical domain using the Cadence V irtuoso layout tool using bottomtotop hierarchical design approach. The design in v olv ed the construction of four main units, such as the e xponential unit, the edge detection unit, the and calculation unit, re gister le, and the accumulator All of the abo v e units ha v e multipliers, adders, adder/subtractor di vider comparator and so on. These small functional units are laid out indi vidually through modularization and later interf aced with each other to get the four abo v e mentioned units. The datapath and the controller are constructed using the main units 239 PAGE 257 and the functional units. The layouts of the gates at the lo west le v el of hierarachy is dra wn using the CMOS standard cell design approach. W e designed our o wn standard cell library containing basic gates, such as AND, OR, NO T The datapath construction in v olv es the implementation of the proposed architecture in the pre vious section. The fundamental functional units are 8bit adders, 8bit multipliers and 8bit adder/subtractor Each adder is constructed using 1bit adders in a ripplecarry manner The adder/subtractor unit is obtained from the adder using XOR gates [180 ]. The carry inputs to the adder/ subtractor and one of the inputs to the XOR gate are set to high whene v er the select signal for this unit is 2 so that a subtraction is carried out. The output of the adder/subtracter module gi v es the absolute v alue of the dif ference of tw o numbers when the dif ference is positi v e. When the dif ference is less than 0 (which is indicated by the carry bit taking a v alue 0), the absolute v alue is obtained by taking the 2' s complement of the output of the adder/subtractor module. An 8bit parallel array multiplier is obtained from fulladders and AND gates to implement multiplication operations with reduced delay [181 ]. The di vider is implemented using the shift and subtract logic for the di vision [180 ]. The number to be di vided is initially stored in tw o re gisters, A and Q, and with each subtraction, the v alues in A and Q are shifted left, with the mostsignicant bit in Q replacing the leastsignicant bit in A, and a 1 placed in the leastsignicant bit of Q. If the v alue in A is less than that of the di visor the same shift procedure is repeated, e xcept that a 0 is placed in the leastsignicant bit of Q. Finally the quotient is a v ailable in the re gister Q, and the remainder in A. The comparator w as designed to compare the v alues of tw o 8bit numbers for greater than, equal to, or lessthan relations. First, a singlebit comparator w as designed to compare the v alues of tw o singlebit numbers, and later instances of this module were cascaded to compare tw o 8bit numbers, starting from the mostsignicant bit position and proceeding to w ards the leastsignicant bit position. The accumulator is implemented as a 14bit re gister to accommodate a maximum v alue ofVGDD. The maximum v alue occurs when each pix el in aR)GRblock assumes the v alue of pure white pix el gray v alue. The re gister le is an addressable array of 8bit re gisters (w ords) [181 ]. 240 PAGE 258 (a) Datapath (b) Controller Figure 9.18. Layout of Datapath and Controller of the Proposed Chip Based on the address specied and a Read/Write select line, at an y time, a v alue can be either written to or read from the re gister le. Here, we used a 5w ord re gister le to store the v e dif ferent constants, such as@ T , c T , z c T , c T , and< c T , in Eqn. 9.9. Multiple xors are used at appropriate places in the design to select one of the incoming lines. Each of such multiple xor is implemented using a combination of transmission gates. Three asynchronously resettable re gisters are designed to encode the v e states of the controller depicted in Fig. 9.17(b). The three re gisters could be reset by the user to return the controller to its intial state at an y time and from there, the w atermarking function could be started afresh. Each of the abo v e mentioned modules are implemented and tested separately and then connected together to obtain the nal chip. The number of gates, po wer and areas of each module is sho wn in T able 9.2.3 for operating v oltage ofZ Z 9. The statistics are obtained using HSPICE for"# Z 'MOSIS SCN3M SCMOS technology It is assumed that the proposed chip is to be used as a module in an y e xisting JPEG encoder or a digital camera, and use their memory The layout of 241 PAGE 259 (a) Chip Layout a k and b k Calculation Unit EdgeDetection Unit Other Components Controller (b) Chip Floor Plan Figure 9.19. Layout and Floor Plan of the Proposed W atermarking Chip the w atermarking datapath is sho wn in Fig. 9.18(a). The layout of the controller is sho wn in Fig. 9.18(b). T able 9.6. Po wer and Area of Dif ferent Units Modules Gate Count Po wer.b 0Delay.b0 Exponential unit 2370 1.2314 0.8981 Edge detection unit 3599 1.4137 1.0967 and calculation unit 16279 3.444 2.0241 Controller 163 0.0034 0.3201 The complete layout of the w atermarking chip is gi v en in Fig. 9.19(a) and the oor plan of the chip is pro vided in Fig. 9.19(b). The clock frequenc y is dri v en by the critical delay of the w ater marking module. T able 9.2.3 sho ws the o v erall design details of the chip and the corresponding pin diagram is sho wn in Fig. 9.20. 242 PAGE 260 T able 9.7. Ov erall Statistics of the W atermarking Chip AreaZ Z YG"ARDSe C Number of gatesDRDS Supply V oltageZ Z 9 Clock frequenc yDSDA* Number of I/O pins Po werASDDRD Second / First aminamaxbminbmaxaI DataOut Visible { ImageDataIn WatermarkDataIn Start Reset Clock Spatial Domain Watermarking Chip Busy DataReady Figure 9.20. Pin Diagram for the Proposed W atermarking Chip 9.2.4 Results and Conclusions Each of the functional units is simulated indi vidually before being inte grated together to dev elop the whole chip. The functional v erication of the whole chip is done by performing w ater marking on v arious test images. Fig. 9.21 sho ws v arious test images and the w atermark image used, which are borro wed from [83 74 77 72 ]. The test images as well as the w atermark images are of dimensionDDeGDD. The w atermark ed images obtained using the rst algorithm is sho wn in Fig. 9.22. F or this algorithm, the v alues oflmF,lok<,lTmF, andlkhare assumed as"#SDA:<"#SDRA:<"#", and"#", respecti v ely Similarly Fig. 9.23 sho ws the w atermark ed images obtained using the second algorithm, assuming as"#" Z. Using simulations, the re gression coef cients, such as$P@,$oC,$ , and$ , are respecti v ely found to be"# ZDZ SDD:<"##7SDRDRA:h"#7RD, and"#qDSDD. 243 PAGE 261 (a) Lena (b) Bird (c) Nuts and Bolts (d) W atermark Figure 9.21. Original Host Images (a, b, and c) and W atermark Image (d) A visual inspection of the w atermark ed images sho ws that the w atermarking process is able to preserv e the quality of the image while e xplicitly pro ving the o wnership. Of the v arious quantitati v e measures a v ailable to quantify the quality of the w atermark ed images, we used signaltonoise ratio. p 0gi v en in Eqn. 9.6. Softw are simulation results sho w that the p for v arious w atermark ed images is in the range of"56E;toD6E;. In this w ork, we ha v e presented a w atermarking chip that can be inte grated within a digital camera frame w ork for w atermarking images. The w atermarking chip can also be inte grated in an y e xisting JPEG encoder The chip has tw o dif ferent types of w atermarking capabilities, both in spatial domain. T o our kno wledge, this is the rst w atermarking chip ha ving visible w atermarking functionalities. Out of the tw o w atermarking schemes implemented, the rst one does pix elbypix el processing and the second one is a blockbyblock processing algorithm. Additional w ork needs to be done to de v elop blockbyblock operation for the rst algorithm so that high perfor 244 PAGE 262 (a) Lena (b) Bird (c) Nuts and Bolts Figure 9.22. W atermark ed Images for the First Algorithm (a) Lena (b) Bird (c) Nuts and Bolts Figure 9.23. W atermark ed Images for the Second Algorithm mance hardw are can be designed. Ho we v er both the algorithms are comparable from the p point of vie w 9.3 In visible and V isible W atermarking in DCT Domain It is well kno wn that the w atermark can pro v e cop yright and pro vide authenticity of the multimedia object. The w atermarking can be performed on the multimedia object either in spatial, DCT or in w a v elet domain. In the pre vious sections we described VLSI implementation of visible and in visible w atermarking algorithms. In this era of portable electronic appliances the po wer consumption is a major issue. Thus, an y VLSI chip will be commercialy viable f its po wer consumption is minimum. VLSI chips operating at multiple supply v oltages are widely proposed as a 245 PAGE 263 solution for lo w po wer optimization. Recently the dynamic (or v ariable) frequenc y and multiple frequenc y ha v e been proposed as techniques for lo w po wer design. In this w ork, we propose DCT domain lo w po wer w ateramarking architectures using both multiple supply v oltages and multiple supply frequenc y The detailed architecture and the prototype chip implementation using TSMC"#D'technology are gi v en in [85 ]. The prototype chip runs at a frequencies of"D"* and"* and v oltages ofA9andA"D"9. 9.3.1 W atermarking Algorithms The spread spectrum in visible w atermarking algorithm from [182 183 80 ] and the DCT domain visible w atermarking algorithm from [74 77 72] are chosen for VLSI implementation. W e used the follo wing notations in our description. 9.3.1.1 Spr ead Spectrum In visible W atermarking Insertion Algorithm In [182 183 80], the w atermark is inserted into the spectral components of the image using technique analogous to spread spectrum communication. The w atermark is inserted judiciously in the perceptually signicant components of a signal to mak e it rob ust to common signal distortions, geometric distortion, and malicious attacks, while maintaining perceptual quality of the image. The insertion of w atermark in the host image is as of follo ws. The DCT coef cients are computed assuming the entire original image as one block. The 1000 lar gest of these coef cients are identied as the perceptually signicant for the image. The w atermarkI T@7:BvC:4EEEE:B@ TTTis computed where each mis chosen according top .c"#:7q0, wherep .c"#:7q0denotes a normal distrib ution with mean 0 and v ariance 1. The w atermark is inserted in the DCT domain of the image by setting the frequenc y components in the original image using the follo wing.$ .b:B0 $ .b:B0.}Og m 0(9.21) The v alues ofandcorresponds to the pix els locations for 1000 lar gest DCT coef cients, and "#. 246 PAGE 264 T able 9.8. Notations used in the Description of the Algorithm U: Original (or host) image (a grayscale image) $ : DCT transformed original image : W atermark image (a grayscale image) $ : DCT transformed w atermark image .b:B0: A pix el location U : W atermark ed image $ : DCT transformed w atermark ed image p G p : Original image dimension (same as w atermark ed image dimension) p G p : W atermark image dimension p G p: Dimension of a block p U ;: Number of original image blocks i n PAGE 265 9.3.1.2 V isible W atermarking Insertion Algorithm The DCT domain visible w atermarking algorithm proposed in [74 77 72 ] incorporates the human visual system (HVS) models to insert w atermark adapti v ely The insertion algorithm is as follo ws. The original imageU(one to be w atermark ed) and the w atermark imageare di vided into blocks of sizep G p The DCT coef cient$ for all the blocks of the original image are found out. F or each block of the original image the mean gray v alue is computed as' +¤ .c"#:<"0. The normalized mean gray v alues is calculated using the follo wing equation.' ( v v § T T § T T T T O k< T T b (9.22) Then the normalized mean gray v alue of the whole image is calculated as follo ws.' ( @ i ji @ 4 T ( i n PAGE 266 The and are scaled to the range.2ilTmFv:hlkh0and.2lTmFu:hglok<0, respecti v ely The edge blocks are determined, and the and for edge blocks are tak en to beilok<and6lmF, respecti v ely The DCT coef cient$ for all the blocks of the w atermark image are found out. The visible w atermark is inserted in the host images blockbyblock and w atermark ed image block is obtained. The number of blocks w atermark ed isp ;, thus "1 p ;d.¤ ¤ Oa ¤ (9.27) 9.3.1.3 Algorithm Modification f or Hard war e Implementations F or in visible w atermarking insertion in Eqn. 9.21 the three lar gest A C DCT coef cients are considered as the candidates.¤ .b:B0T ¤ .b:B0O e .b:B0 .where,V "p U ;dq0(9.28) Where,.b:B0corresponds to the three lar gest A C DCT v alues for  yblock. The random number matrix is constructed from the random numberI @ :B C :4EE. F or visible w atermarking algorithm the edge detection is an important step. The rst step of edge detection in v olv es summation of the absolute v alues of all A C DCT coef cients of each block as follo ws. @ in PAGE 267 the DC DCT of a block ha ving all white pix els. Thus, the Eqn. 9.22 is modied to the follo wing:' ( v v } T T } T T (9.30) W e aim at impro ving the performance de gradation due to normalization in v olv ed in Eqn. 9.25. no w we aim at impro ving the performance de gradation due to this step. Using 9.25 in Eqn. 9.26, we ha v e the follo wing equation. § L4 d^.b' ( d' ( 0 C § dL7 d^.b' ( d' ( 0 C 7(9.31) The f actor §in Eqn. 9.31 serv es as a constant scaling f actor Hence, we redene the equations as follo ws. L4 d^.b' ( d' ( 0 C @ dL7 d^.b' ( d' ( 0 C q(9.32) Where, the and v alues are current v alues of and , respecti v ely The abo v e equations contain e xponential.L70, which needs to be addressed. Eqn. 9.32 can be re written using T aylor series approximation upto the square term as follo ws. d[.b' ( d' ( 0 C ON.b' ( d' ( 0 @ .b' ( d' ( 0 C d.b' ( d' ( 0 (9.33) No w the and are scaled to the range.2lmFv:hlkh0and./lmFv:hglok<0, respecti v ely The scaled and are respecti v ely the s and s we are looking for 9.3.2 VLSI Ar chitectur e The o v erall architecture for the proposed DCT domain w atermarking chip is sho wn in Fig. 9.24 which can insert both in visible and visible w atermarks. This is a decentralized controller architec250 PAGE 268 Random Number Module Generation Module Invisible Insertion Edge Detection Module aminamax minb bmax Scaling and Embedding Factor Module Visible Insertion Module DCT Module DCT Module Module Perceptual Analyzer a Watermarked Image Watermark Image Original Image Invisible Watermarking Visible Watermarking Figure 9.24. Combined Architecture for DCT domain In visible and V isible W atermarking Chip ture in which each module has its o wn controller Here, we pro vide the proposed architecture in brief. The detailed architecture and the corresponding VLSI implementation are gi v en in [85 ]. The modules used for in visible w atermark insertion are DCT random number generator and in visible insertion (sho wn in Fig. 9.25). After the DCT coef cients of the host image is calculated using DCT module, insertion module adds the random numbers to them. Theparameter is also gi v en as input to the insertion module. The three appropriate A C DCT coef cients are chosen for w atermark insertion using a counter The DCT module is sho wn in Fig. 9.25(a). The DCT module consists of the follo wing three submodules: (i) DCT\, (ii) DCT], and (iii) Controller Apart from the abo v e, ipops and latches are also used to store and forw ard the appropriate A CDCT coef cients to the insertion module. The architecture of both the DCT\and DCT]modules are borro wed from [184 185 ]. Both DCT\and DCT]use sixteen multipliers and twelv e adders. All multipliers and adders pertain to IEEE 754 standard as implemented in IEEE.std logic arith package in VHDL [186 ]. The DCT controller determines the coef cients to be forw arded, the memory addresses where the coef cients are to be stored, the time to trigger the in visible insertion module, and the random number generation module. The in visible w atermark insertion module is sho wn in Fig. 9.25(b). The insertion module, which consists of a multipler and an adder has its 251 PAGE 269 (constants) Buffers (constants) Buffers Decoder FlipFlop DCT_Y DCT_X Latch Latch From controller 36 Input Image 72 36 288 52 208 13 195 Coefficients Coefficients DCDCT ACDCT (a) DCT Module Multiplier Adder Random Numbers 13 13 DCTCoefficients Input 13 Watermarked DCTCoefficients 26 a 26 (b) In visible Insertion Module Figure 9.25. Architecture of the Dif ferent Units used for In visible W atermarking o wn controller The insertion module scales the random number generated withand adds it to the DCT coef fcient. The random number generation module consists of linear feedback shift re gisters (LFSR) [180 ]. The v e modules in v olv ed in visible w atermarking are as follo ws : (i) DCT module, (ii) Edge Detection module (iii) Perceptual Analyzer module, (i v) Scaling and Embedding F actor module, and (v) V isible W atermark Insertion module. Each of the abo v e modules are discussed in detail belo w The architecture of the DCT module is same as the one discussed in the pre vious section (Fig. 9.25(a)). The architecture of the rest are sho wn in Fig. 9.26. The edge detection module determines the edge blocks in the original image. The threshold constantis gi v en as input to the edge detection module. The three parts of the edge detection module implement a particular function, such as accumulation, comparison and detection needed for edge detection (refer Eqn. 9.29). The perceptual analyzer module e v aluates the Eqns. 9.22 and 9.25. Similar to the edge detection module, the perceptual analyzer module is also di vided into three sub modules. The rst sub module, namely the mean calculator computes the mean of the A CDCT coef cients. The result of this submodule is passed onto the ne xt submodule called the v ariance calculator module, which 252 PAGE 270 Accumulator Comparator  m AC I k   m AC I k  Max ( ) Edge Detector ACDCT Coefficients 13 17 17 t 17 Block Edge or Nonedge 17 (a) Edge Detection Module AC mean m I k AC s AC I k m I k DC m DC I Variance 13 ACDCT Coefficients 26 13 13 Coefficients 13 DCDCT DC mean 13 13 (b) Perceptual Analyser Module Scaling Module Scaling Module DC m I DC m I k s AC I k a k b k 13 13 24 24 AlphaBeta Module 13 26 13 (c) Scaling and Embedding F actor Module b min c I W k c W k c I k a k b k a max 13 13 13 13 13 13 26 Visible Insertion Module sel (d) V isible Insertion Module Figure 9.26. Architecture of the Dif ferent Units used for V isible W atermarking calculates the v ariance in the A CDCT coef cients. The DCDCT mean calculator is the third submodule of the perceptual analyzer These submodules are implemented with adders, and feedback ipops, etc.. The scaling f actor and the embedding f actor are computed by the Scaling and Embedding f actor module using Eqn. 9.33. This module is di vided into tw o sub modules. The rst module calculates the scaling f actors and the embedding f actors and is called the alphabeta module. The second sub module scales do wn the scaling and embedding f actors to a particular range depending on the user dened ranges.2ilTmFu:hglok<0and.blTmF:B6lkh0. The last module in this chip is the w atermark insertion module. It serv es the purpose of inserting the w atermark into the original image. Using the information pro vided by the edge detection module and the scaling and embedding f actor module, the w atermark is inserted into the original image. It consists of tw o multipliers and an adder for e v aluating the Eqn. 9.27 and has similar 253 PAGE 271 DCT_X DCT_Y Invisible Watermark Insertion Visible Watermark Insertion Scaling and Embedding Factor Module Slower Clock Lower Voltage Normal Voltage Normal Clock Edge Detection Module Perceptual Analyzer Module Figure 9.27. Dual V oltage and Dual Frequenc y Operation of the Datapath architecture as that of in visible insertion module (in Fig. 9.25(b)). Multiple xors are used to select appropriate v alues of and for a nonedge blocks and an edge blocks. The chip is to be operated with dual frequenc y dual v oltage supplies (refer Fig. 9.27). Apart from the dual clock supplies, local clocks are automatically generated to trigger the operation of some modules. These local clocks are generated from the localized controllers embedded within each module. This type of clock generation within the chip helps to indirectly implement the clock gating technique. A lo w v oltage supply is used for the DCT modules. The chip is implemented in such a w ay that the clock for the nonDCT modules must be an e xact multiple of the clock for the DCT module. The DCT block processes 4 image pix els at a time. The other modules in the chip operate on one pix el at at time. Hence the DCT block can be clock ed at one fourth the nonDCT clock frequenc y The delay of the DCT module is less than its clock period. In this w ay there is a slack introduced in the DCT module which mak es it possible to operate the DCT module at a lo wer v oltage. The combination of lo w clock frequenc y and lo w v oltage supply translates to lo wer po wer consumption by the DCT module. A hierarchical design approach w as adopted in implementing the chip. Standard cell design methodology w as used for generating the layout. The standard cell design library used w as obtained from [187 ], which is designed using TSMC"#D'CMOS technology The standard cell library includes basic gates, ip ops, IO pads and corner cells. The layout for each module w as generated and later inte grated to obtain the nal chip. The detailed implementation of the DCT domain 254 PAGE 272 w atermarking chip is discussed in [85 ]. The layout of the o v erall chip, oorplan of the chip and chip statistics are gi v en. Figure 9.28. Layout of the DCT Domain In visible and V isible W atermarking Chip [85 ] DCT_Y Image Module DCT_X Module DCT_X Module Watermark Module Insertion Visible DCT_Y Watermark Insertion Module Invisible Module Edge Detection Module Perceptual Analyzer Module Scaling and Embedding Factor Module Image Figure 9.29. Floorplan of the DCT Domain In visible and V isible W atermarking Chip [85 ] T able 9.9. Ov erall Statistics of the DCT Domain W atermarking Chip [85 ] Area"YGe"e C Supply V oltagesA9and9 Operating FrequenciesDRDD*+and"* Po wer (Dual V oltage and Frequenc y)"# Z D Po wer (Normal Operation)SD 255 PAGE 273 CHAPTER 10 CONCLUSIONS AND FUTURE W ORK The reduction of peak po wer peak po wer dif ferential, a v erage po wer and ener gy are equally important. In this dissertation, we propose a frame w ork for the reduction of these parameters through datapath scheduling at beha vioral le v el. Se v eral ILP based and heuristic based scheduling schemes are de v eloped for datapath synthesis to minimize ener gy ener gy delay product, peak po wer simultaneous peak po wer and a v erage po wer simultaneous peak po wer a v erage po wer peak po wer dif ferential and ener gy and po wer uctuation. Three modes of circuit design, such as, single supply v oltage and single frequenc y (SVSF), multiple supply v oltages and dynamic frequenc y clocking (MVDFC), and multiple supply v oltages and multic ycling (MVMC) are considered. A ne w parameter called Cycle Po wer Function./$'%'&10is dened which captures the transient po wer characteristics as the equally weighted sum of normalized mean c ycle po wer and normalized mean c ycle dif ferential po wer The ILP based schemes pro vide optimal solutions, ho we v er the gro wth of the problem comple xity is e xponential in terms of number of operations in the data o w graph. The alternate method is the heuristic based approach. The heuristics based algorithms pro vide polynomial time bound solutions for the scheduling problem. The reduction in ener gy and ener gy delay product w as approximately the same for both heuristic and ILPbased methods. Similarly the peak po wer (and a v erage po wer) minimization w as appreciably high for peak and a v erage po wer minimization w ork. The signicant results be gin accomplished by c ycle po wer function minimization w orks, which pro vided reduction in transient po wer and ener gy Similarly comparison of multic ycling based w orks with dynamic frequenc y clocking based w orks re v eal that dynamic frequenc y clocking based w orks outperform in almost all instances. 256 PAGE 274 None of the datapath scheduling algorithms a v ailable in current literature minimize transient po wer There are fe w w orks a v ailable that handle peak po wer minimization. There are no research w orks handling both v oltage and frequenc y parameters. Thus, we conclude an y of the lo w po wer datapath scheduling algorithms proposed in this dissertation can create strong impact lo w po wer beha vioral synthesis research. The dissertation also in v olv ed design of visible and in visible w atermarking chips both in spatial and DCT domains. The chips can be easlily inte grated with an y e xisting JPEG encoder or still digital camera. While the combined rob ustfragile spatial domain in visible w atermarking chip consumesA" po wer the spatial domain visible w atermarking chip consumedAS Z The w atermark ed images produced by the w atermarking chips are comparable with that obtained using the corresponding softw are implementations. The DCT domain w atermarking chip is capable of inserting spread spectrum in visible w atermark and an adapti v e visible w atermark. It operates at dual supply v oltages and dual frequenc y mode. All the w atermarking chip designed are the rst implementatios in the respecti v e cate gory At this digital age, when the cop yright and piray are threat to indudtrial gro wths, the secure digital de vices inte grated with w atermarking chips can produce cop yrighted multimedia data in realtime. The scheduling algorithms need to be e xtended to include pipelined datapaths in both dynamic frequenc y clocking and multic ycling scenarios. The benchmarks used to test the scheduling schemes are data intensi v e digital signal processing benchmark circuits. The ef fecti v eness of scheduling algorithms for control intensi v e applications needs to be in v estigated. Inte ger Linear Programming(ILP) based techniques for datapath scheduling are optimal, b ut cannot handle lar ge benchmark circuits. Heuristic algorithms are f ast, b ut generate sub optimal solutions should be used for scheduling of lar ge benchmarks. The po wer model may be modied to consider the ef fect of e xact switching acti vity and binding. More research is needed to de v elop lo w po wer dynamic clocking units for the generation of dynamic frequencies in VLSI circuits. The ef fect of dynamic clocking on the o v erall clock netw ork has to be studied. Similarly the design w orks can be e xtended to de v elop pipelined and / or SIMD based designs. 257 PAGE 275 REFERENCES [1] A. Chandrakasan, S. Sheng, and R. W Brodersen, Lo wPo wer CMOS Digital Design, IEEE J ournal of SolidState Cir cuits v ol. 27, no. 4, pp. 473483, Apr 1992. [2] Y L. Lin, Recent De v elopments in HighLe v el Synthesis, A CM T r ansactions on Design A utomation of Electr onic Systems v ol. 2, no. 1, pp. 221, Jan 1997. [3] M. C. McF arland, Alice C. P ark er and Raul Camposano, The HighLe v el Synthesis of Digital Systems, Pr oceedings of the IEEE v ol. 78, no. 2, pp. 301318, Feb 1990. [4] D. Gajski and N. Dutt, HighLe vel Synthesis: Intr oduction to Chip and System Design Kluwer Academic Publishers, 1992. [5] D. Singh, J. M. Rabae y M. Pedram, F Catthoor S. Rajgopal, N. Sehgal, and T J. Mozdzen, Po wer Conscious CAD T ools and Methodologies: A Perspecti v e, Pr oceedings of the IEEE v ol. 83, no. 4, pp. 570594, Apr 1995. [6] D. Sylv ester and H. Kaul, Po wer Dri v en Challanges in Nanometer Design, IEEE Design and T est of Computer s v ol. 13, no. 6, pp. 1221, No vDec 2001. [7] D. Sylv ester and H. Kaul, Future Performance Challanges in Nanometer Design, in Pr oceedings of the 38th Design A utomation Confer ence June 2001, pp. 38. [8] V T iw ari, D. Singh, S. Rajgopal, G. Mehta, R. P atel, and F Baez, Reducing Po wer in HighPerformance Microprocessors, in Pr oceedings of the A CM / IEEE Design A utomation Confer ence 1998, pp. 732737. [9] L. Benini, G. De Michelli, and A. Macii, Designing Lo wPo wer Circuits : Practical Recipes, IEEE Cir cuits and Systems Ma gazine v ol. 1, no. 1, pp. 625, March 2001. [10] S. Borkar Design challenges of technology scaling, IEEE Micr o v ol. 19, no. 4, pp. 2329, JulyAug 1999. [11] V De and S. Borkar T echnology and design challenges for lo w po wer and high perfor mance [microprocessors], in Pr oceedings of the International Symposium on Low P ower Electr onics and Design 1999, pp. 163168. [12] D. E. Lack e y P S. Zucho wski, and J. K oehl, Designing me gaASICs in nanogate technologies, in Pr oceedings of the Design A utomation Confer ence 2003, pp. 770775. [13] E. Sicard and S. D. Bendhia, Deepsubmicr on CMOS Cir cuit Design (simulator in Hands) Brooks/Coles, 2003. 258 PAGE 276 [14] J. S. Lis and D. D. Gajski, Synthesis from VHDL, in Pr oceedings of the International Confer ence on Computer Design 1988, pp. 378381. [15] R. Composano and W W olf, HighLe vel VLSI Synthesis Kluwer Academic Publishers, 1991. [16] A. Raghunathan, N. K. Jha, and S. De y HighLe vel P ower Analysis and Optimization Kluwer Academic Publishers, 1998. [17] M. Pedram, Po wer Minimization in IC Design: Principles and Applications, A CM T r ansactions on Design A utomation of Electr onic Systems v ol. 1, no. 1, pp. 356, Jan. 1996. [18] L. Benini and G. De Micheli, SystemLe v el Po wer Optimization: T echniques and T ools, A CM T r ansactions on Design A utomation of Electr onic Systems v ol. 5, no. 2, pp. 115192, Apr 2000. [19] J. M. Chang and M. Pedram, P ower Optimization and Synthesis at Behavior al and System Le vels using F ormal Methods Kluwer Academic Publishers, 1999. [20] K. Ro y and S. C. Prasad, Low P ower CMOS VLSI Cir cuits John W ile y and Sons, 2000. [21] G. De Micheli, Synthesis and Optimization of Digital Cir cuits McGra wHill, Inc., 1994. [22] C. P ark, T ask Sc heduling in High Le vel Synthesis Ph.D. thesis, Uni v ersity of Illinois at UrbanaChampaign, 1996. [23] A. C. P ark er J. Pizarro, and M. Mlinar MAHA : A Program for Datapath Synthesis, in Pr oceedings of the 23r d A CM / IEEE Design A utomation Confer ence June 1986, pp. 461466. [24] P G. P aulin and J. P Knight, F orce Directed Scheduling for the Beha vioral Synthesis of ASICs, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 8, no. 6, pp. 661679, June 1989. [25] S. De v adas and A. R. Ne wton, Algorithms for Allocation in Datapath Synthesis, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 8, no. 7, pp. 768781, July 1989. [26] P G. P aulin and J. P Knight, Scheduling and Binding Algorithms for HighLe v el Synthesis, in Pr oceedings of 26th A CM / IEEE Design A utomation Confer ence June 1989, pp. 16. [27] C. A. P apachristou and H. K onuk, A Linear Program Dri v en Scheduling and Allocation Method, in Pr oceedings of the 27th A CM/IEEE Design A utomation Confer ence 1990, pp. 7783. [28] I. C. P ark and C. M. K yung, F ast and Near Optimal Scheduling in Automatic Data P ath Synthesis, in Pr oceedings of the 28th Design A utomation Confer ence 1991, pp. 680685. 259 PAGE 277 [29] R. Jain, A. Majumdar A. Sharma, and H. W ang, Empirical Ev aluation of Some HighLe v el Synthesis Scheduling Heuristics, in Pr oceedings of the 28th Design A utomation Confer ence 1991, pp. 210215. [30] C. T Hw ang and J. H. Lee aand Y C. Hsu, A F ormal Approach to the Scheduling Problem in High Le v el Synthesis, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 10, no. 4, pp. 8593, April 1991. [31] R. A. W alk er and S. Chaudhuri, Introduction to the Scheduling Problems, IEEE Design and T est of Computer s v ol. 12, no. 2, pp. 6069, Summer 1995. [32] S. Raje and M. Sarrafzadeh, GEM : A Geometric Algorithm for Scheduling, in Pr oceedings of the IEEE International Symposium on Cir cuits and Systems (V ol. 3) 1993, pp. 19911994. [33] J. Zhu and D. D. Gajski, Soft Scheduling in High Le v el Synthesis, in Pr oceedings of the 36th Design A utomation Confer ence 1994, pp. 219224. [34] M. J .M. Heijligers, L. J. M Cluitmans, and J. A. G. Jess, Highle v el Synthesis Scheduling and Allocation using Genetic Algorithms, in Pr oceedings of the 28th Design A utomation Confer ence 1991, pp. 6166. [35] S. Haynal and F Bre wer AutomataBased Symbolic Scheduling for Looping DFGs, IEEE T r ansactions on Computer s v ol. 50, no. 3, pp. 250267, Mar 2001. [36] R. Camposano, P athBased Scheduling for Synthesis, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 10, no. 1, pp. 8593, Jan 1991. [37] P G. P aulin and J. P Knight, Algorithms for HighLe v el Synthesis, IEEE Design and T est of Computer s v ol. 6, no. 6, pp. 1831, Dec 1999. [38] E. Musoll and J. Cortadella, Scheduling and Resource Binding for Lo w Po wer, in Pr oceedings of the 8th International Symposium on System Synthesis 1995, pp. 104109. [39] H. J. M. V eendrick, ShortCircuit Dissipation of Static CMOS Circuitry and its Impact on the Deisgn of Buf fer Circuit, IEEE J ournal of SolidState Cir cuits v ol. 19, no. 4, pp. 468473, Aug 1984. [40] A. C. W illiams, A. D. Bro wn, and M. Zw olinski, Simultaneous Optimization of Dynamic Po wer Area and Delay in Beha vrioal Synthesis, IEE Pr oceedings on Computer and Digital T ec hniques v ol. 147, no. 6, pp. 383390, No v 2000. [41] R. S. Martin and J. P Knight, Optimizing Po wer in ASIC Beha vioral Synthesis, IEEE Design and T est of Computer s v ol. 13, no. 2, pp. 5870, Summer 1996. [42] A. P Chandrakasan and R.W Brodersen, Minimizing Po wer Consumption in Digital CMOS Circuits, Pr oceedings of the IEEE v ol. 83, no. 4, pp. 498523, April 1996. [43] H. S. Y un and J. Kim, Po wer A w are Modulo Scheduling for HighPerformance VLIW Processors, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design 2001, pp. 4045. 260 PAGE 278 [44] R. S. Martin and J. P Knight, Using Spice and Beha vioral Synthesis T ools to Optimize ASICs' Peak Po wer Consmpution, in Pr oceedings of the 38th Midwest Symposium on Cir cuits and Systems 1996, pp. 12091212. [45] S. P Mohanty N. Ranganathan, and S. K. Chappidi, Peak Po wer Minimization Through Datapath Scheduling, in Pr oceedings of the IEEE Computer Society Annual Symposium on VLSI Feb 2003, pp. 121126. [46] S. P Mohanty N. Ranganathan, and S. K. Chappidi, Simultaneous Peak and A v erage Po wer Minimization During Datapath Scheduling for DSP Processors, in Pr oceedings of the A CM Gr eat Lak es Symposium on VLSI Apr 2003, pp. 215220. [47] V Raghunathan, S. Ra vi, A. Raghunathan, and G. Lakshminarayana, T ransient Po wer Management through High Le v el Synthesis, in Pr oceedings of the International Confer ence on Computer Aided Design 2001, pp. 545552. [48] S. P Mohanty and N. Ranganathan, A Frame w ork for Ener gy and T ransient Po wer Reduction During Beha vioral Synthesis, in Pr oceedings of the International Confer ence on VLSI Design Jan 2003, pp. 539545. [49] L. Benini, G. Casterlli, A. Macii, and R. Scarsi, BatteryDri v en Dynamic Po wer Management, IEEE Design and T est of Computer s v ol. 13, no. 2, pp. 5360, Mar Apr 2001. [50] T Burd and R. W Brodersen, Ener gy Ef cient CMOS Microprocessor Design, in Pr oceedings of the 28th Hawaii International Confer ence on System Sciences 1995, pp. 288 297. [51] J. M. Chang and M. Pedram, Ener gy Minimization using Multiple Supply V oltages, IEEE T r ansactions on VLSI Systems v ol. 5, no. 4, pp. 436443, Dec 1997. [52] J. Pouwelse, K. Langendoen, and H.Sips, Ener gy Priority Scheduling for V ariable V oltage Processor, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design Aug 2001, pp. 2833. [53] J. Rabae y Digital Inte gr ated Cir cuits: A Design P er spective Prentice Hall, Inc., Upper Saddle Ri v er NJ, 1996. [54] S. P Mohanty N. Ranganathan, and V Krishna, Datapath Scheduling using Dynamic Frequenc y Clocking, in Pr oceedings of the IEEE Computer Society Annual Symposium on VLSI Apr 2002, pp. 6570. [55] S. P Mohanty and N. Ranganathan, Ener gy Ef cient Scheduling for Datapath Synthesis, in Pr oceedings of the International Confer ence on VLSI Design Jan 2003, pp. 446451. [56] N. K. Jha, Lo w Po wer System Scheduling and Synthesis, in Pr oceedings of the International Confer ence on Computer Aided Design 2001, pp. 259263. [57] T L. Martin and D. P Sie wiorek, Nonideal Battery and Main Memory Ef fects on CPU SpeedSetting for Lo w Po wer, IEEE T r ansactions on VLSI Systems v ol. 9, no. 1, pp. 29 34, Feb 2001. 261 PAGE 279 [58] T Pering, T Burd, and R. W Brodersen, V oltage Scheduling in the lpARM Microprocessor System, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design 2000, pp. 96101. [59] N. Ranganathan, N. V ijaykrishnan, and N. Bha v anishankar A Linear Array Processor with Dynamic Frequenc y Clocking for Image Processing Applications, IEEE T r ansactions on Cir cuits and Systems for V ideo T ec hnolo gy v ol. 8, no. 4, pp. 435445, August 1998. [60] N. Ranganathan, N. V ijaykrishnan, and N. Bha v anishankar A VLSI Array Architecture with Dynamic Frequenc y Clocking, in Pr oceedings of the International Confer ence on Computer Design 1996, pp. 137140. [61] I. Brynjolfson and Z. Zilic, FPGA Clock Management for Lo w Po wer, in Pr oceedings of the International Symposium on FPGAs 2000, pp. 219219. [62] I. Brynjolfson and Z. Zilic, Dynamic Clock Management for Lo w Po wer Applications in FPGAs, in Pr oceedings of the IEEE Custom Inte gr ated Cir cuits Confer ence 2000, pp. 139142. [63] J. M. Kim and S. I. Chae, Ne w MPEG2 Decoder Architecture using Frequenc y Scaling, in Pr oceedings of the IEEE International Symposium on Cir cuits and Systems 1996, pp. 253256. [64] S. P Mohanty N. Rangnathan, and S. K. Chappidi, An ILPBased Scheduling Scheme for Ener gy Ef cient High Performance Datapath Synthesis, in Pr oceedings of the International Symposium on Cir cuits and Systems (V ol. 5) 2003, pp. 313316. [65] M. Johnson and K. Ro y Datapath Scheduling with Multiple Supply V oltages and Le v el Con v erters, A CM T r ansactions on Design A utomation of Electr onic Systems v ol. 2, no. 3, pp. 227248, July 1997. [66] M. Igarashi, K. Usami, K. Nogami, F Minami, Y Ka w asaki, T Aoki, M. T akano, S. Sonoda, M. Ichida, and N. Hatanaka, A lo wpo wer design method using multiple supply v oltages, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design Aug 1997, pp. 1820. [67] M. Hamada, M. T akahashi, H. Arakida, A. Chiba abd T T eraza w a, T Ishika w a, M. Kanaza w a, M. Igarashi, K. Usami, and T K uroda, A T opDo wn Lo w Po wer Design T echnique Using Clusture V oltage Scaling with V ariable SupplyV oltage Scheme, in Pr oceedings of the 1998 IEEE Costum Inte gr ated Cir cuits Confer ence 1998, pp. 495498. [68] K. Usami, M. Igarashi, F Minami, T Ishika w a, M. Kanza w a, M. Ichida, and K. Nogami, Automated lo wpo wer technique e xploiting multiple supply v oltages applied to a media processor, IEEE J ournal of SolidState Cir cuits v ol. 33, no. 3, pp. 463472, Mar 1998. [69] K. Usami, K. Nogami, M. Igarashi, F Minami, Y Ka w asaki, T Ishika w a, M. Kanza w a, T Aoki, M. T akano, C. Mizuno, M. Ichida, S. Sonoda, M. T akahashi, and N. Hatanaka, Automated lo wpo wer technique e xploiting multiple supply v oltages applied to a media processor, in Pr oceedings of the IEEE 1997 Custom Inte gr ated Cir cuits Confer ence May 1997, pp. 131134. 262 PAGE 280 [70] S. Katzenbeisser and F A. P Petitcolas, Information Hiding tec hniques for ste gano gr aphy and digital watermarking Artech House, Inc., MA, USA, 2000. [71] N. Memon and P W W ong, Protecting Digital Media Content, Communications of the A CM v ol. 41, no. 7, pp. 3443, Jul 1998. [72] S. P Mohanty W atermarking of Digital Images, M.S. thesis, Indian Institute of Science, Bangalore, India, 1999. [73] G. W Brauda w ay K. A. Magerlein, and F Mintzer Protecting Publicly A v ailable Images with a V isible Image W atermark, in Pr oceedings of the SPIE Confer ence on Optical Security and Counteret Deterr ence T ec hnique (V ol. SPIE2659) 1996, pp. 126132. [74] S. P Mohanty K. R. Ramakrishnan, and M. S. Kankanhalli, A DCT Domain V isible W atermarking T echnique for Images, in Pr oceedings of the IEEE International Confer ence on Multimedia and Expo 2000, pp. 10291032. [75] J. Meng and S. F Chang, Embedding V isible V ideo W atermarks in the Compressed Domain, in Pr oceedings of the International Confer ence on Ima g e Pr ocessing (V ol. 1) 1998, pp. 474477. [76] Y Hu and S. Kw ong, W a v elet Domain Adapti v e V isible W atermarking, IEE Electr onics Letter s v ol. 37, no. 20, pp. 12191220, Sep 2001. [77] S. P Mohanty K. R. Ramakrishnan, and M. S. Kankanhalli, An Adapti v e DCT Domain V isible W atermarking T echnique for Protection of Publicly A v ailable Images, in Pr oceedings of the International Confer ence on Multimedia Pr ocessing and Systems 2000, pp. 195 198. [78] P W ayner Disappearing Crypto gr aphy Information Hiding : Ste gano gr aphy and W ater marking Mor gan Kaufmann, CA, USA, 2002. [79] M. Kankanahalli, Rajmohan, and K. R. Ramakrishnan, Content Based W atermarking for Images, in Pr oceedings of the 6th A CM International Multimedia Confer ence 1998, pp. 6170. [80] I. J. Cox, J. Kilian, F T Leighton, and T Shamoon, Secure Spread Spectrum W atermarking for Multimedia, IEEE T r ansactions on Ima g e Pr ocessing v ol. 6, no. 12, pp. 16731687, Dec 1997. [81] W Zhu, Z. Xiong, and Y Q. Zhang, Multiresolution W atermarking for Images and V ideo, IEEE T r ansanctions on Cir cuits and Systems for V ideo T ec hnolo gy v ol. 9, no. 4, pp. 545 550, June 1999. [82] R. G. W olfgang and E. J. Delp, A W atermark for Digital Images, in Pr oceedings of the IEEE International Confer ence on Ima g e Pr ocessing (V ol. 3) 1996, pp. 219222. [83] S. P Mohanty K. R. Ramakrishnan, and M. S. Kankanhalli, A Dual W atermarking T echnique for Images, in Pr oceedings of the 7th A CM International Multimedia Confer ence (V ol. 2) 1999, pp. 4951. 263 PAGE 281 [84] J. Fridrich and M. Goljan, Images with SelfCorrecting Capabilties, in Pr oceedings of the International Confer ence on Ima g e Pr ocessing (V ol. 3) 1999, pp. 792796. [85] K. Balakrishnan, A Dual V oltage and Dual Frequenc y Lo w Po wer VLSI Implementation of DCT Domain Image W atermarking Schemes, M.S. thesis, Uni v ersity of South Florida, F all, 2003. [86] M. Johnson and K. Ro y Optimal Selection of Supply V oltages and Le v el Con v ersions during Datapath Scheduing under Resource Constraints, in Pr oceedings of the International Confer ence on Computer Design Oct 1996, pp. 7277. [87] M. Johnson and K. Ro y Scheduling and Optimal V oltage Selection for Lo w Po wer MultipleV oltage DSP Datapaths, in Pr oceedings of the IEEE Symposium on Cir cuits and Systems (V ol. 3) June 1997, pp. 21522155. [88] J. M. Chang and M. Pedram, Ener gy Minimization Using Multiple Supply V oltages, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design 1996, pp. 157162. [89] Y R. Lin, C. T Hw ang, and A. C. H. W u, Scheduling T echniques for V ariable V oltage Lo w Po wer Design, A CM T r ansactions on Design A utomation of Electr onic Systems v ol. 2, no. 2, pp. 8197, Apr 1997. [90] M. Sarrafzadeh and S. Raje, Scheduling with Multiple V oltages under Resource Constraints, in Pr oceedings of the IEEE Symposium on Cir cuits and Systems (V ol. 1) 1999, pp. 350353. [91] A. K umar and M. Bayoumi, Multiple V oltageBased Scheduling Methodology for Lo w Po wer in the High Le v el Synthesis, in Pr oceedings of the International Symposium on Cir cuits and Systems (V ol. 1) July July 1999, pp. 371379. [92] A. K umar and M. Bayoumi, A no v el schedulingbased CAD methodology for e xploring the design space of ASICs for lo w po wer, in Pr oceedings of the 11th Annual IEEE Inter national ASIC Confer ence Sep 1998, pp. 115118. [93] A. K umar and M. Bayoumi, A no v el schedulingbased CAD methodology for e xploring the design space of ASICs for lo w po wer, in Pr oceedings of the 1998 IEEE AsiaP acic Confer ence on Cir cuits and Systems No v 1998, pp. 391394. [94] M. A. Elgamel and M. Bayoumi, On lo wpo wer highle v el synthesis using genetic algorithms, in Pr oceedings of the 9th International Confer ence on Electr onics, Cir cuits and Systems (V ol. 2) No v 2002, pp. 725728. [95] W T Shiue and C. Chakrabarti, Lo wPo wer Scheduling with Resources Operating at Multiple V oltages, IEEE T r ansactions on Cir cuits and SystemsII : Analo g and Digital Signal Pr ocessing v ol. 47, no. 6, pp. 536543, June 2000. [96] W T Shiue and C. Chakrabarti, Lo w po wer scheduling with resources operating at multiple v oltages, in Pr oceedings of the 9th International Symposium on Cir cuits and Systems (V ol. 2) June 1998, pp. 437440. 264 PAGE 282 [97] A. Manzak and C. Chakrabarti, A Lo w Po wer Scheduling Scheme with Resources Operating at Multiple V oltages, IEEE T r ansactions on VLSI Systems v ol. 10, no. 1, pp. 614, Feb 2002. [98] A. Manzak and C. Chakrabarti, A Lo w Po wer Scheduling Scheme with Resources Oper ating at Multiple V oltages, in Pr oceedings of the 1999 IEEE International Symposium on Cir cuits and Systems (V ol. 1) July 1999, pp. 354357. [99] N. K umar S. Katk oori, L. Rader and R. V emuri, Proledri v en Beha vioral Synthesis for Lo w Po wer VLSI System, IEEE Design and T est of Computer s v ol. 12, no. 3, pp. 7084, F all 1995. [100] S. Katk oori, N. K umar and L. Rader and; R. V emuri, A prole dri v en approach for lo w po wer synthesis, in Pr oceedings of the International Confer ence on Asian and South P acic Design A utomation Confer ence (ASPD A C) 1995, pp. 759765. [101] A. Raghunathan and N. K. Jha, SCALP: An Iterati v eImpro v ement Based Lo wPo wer Datapath Synthesis System, IEEE T r ansactions on CAD of Inte gr ated Cir cuits and Systems v ol. 16, no. 11, pp. 12601277, No v 1997. [102] A. Raghunathan and N. Jha, Beha vioral Synthesis for Lo w Po wer, in Pr oceedings of the International Confer ence on Computer Design 1994, pp. 318322. [103] S. Bhatia and N. K. Jha, Beha vioral Synthesis for Hierarchical T estability of Controller / Datapath Circuit with Conditional Branches, in Pr oceedings of the International Confer ence on Computer Design Oct. 1994. [104] L. Y Chiou, K. Muhammand, and K. Ro y DSP data path synthesis for lo wpo wer applications, in Pr oceedings of the International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (V ol2) 2001, pp. 11651168. [105] K. S. Khouri, G. Lakshminarayana, and N. K. Jha, Highle v el synthesis of lo wpo wer controlo w intensi v e circuits, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 18, no. 12, pp. 17151729, Dec 1999. [106] R. Henning and C. Chakrabarti, An approach to switching acti vity consideration during highle v el, lo wpo wer design space e xploration, IEEE T r ansactions on Cir cuits and Systems II: Analo g and Digital Signal Pr ocessing v ol. 49, no. 5, pp. 339351, May 2002. [107] R. Henning and C. Chakrabarti, Acti vity models for use in lo w po wer highle v el synthesis, in Pr oceedings of the International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (V ol. 4) Mar 1999, pp. 18811884. [108] W T Shiue and C. Chakrabarti, ILP Based Scheme for Lo w Po wer Scheduling and Resource Binding, in Pr oceedings of the IEEE International Symposium on Cir cuits and Systems (V ol. 3) 2000, pp. 279282. [109] M. Lundber g, K. Muhammad, K. Ro y and S. K. W ilson, Highle v el modeling of switching acti vity with application to lo wpo wer DSP system synthesis, in Pr oceedings of the International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (V ol.4) Mar 1999, pp. 18771880. 265 PAGE 283 [110] M. Lundber g, K. Muhammad, K. Ro y and S. K. W ilson, A no v el approach to highle v el switching acti vity modeling with applications to lo wpo wer DSP system synthesi, IEEE T r ansactions on Signal Pr ocessing v ol. 49, no. 12, pp. 31573167, Dec 2001. [111] M. K. Shin and C. H. Lin, An ef cient resource allocation algorithm with minimal po wer consumption, in Pr oceedings of the IEEE Re gion 10 International Confer ence on Electrical and Electr onic T ec hnolo gy (V ol. 2) 2001, pp. 703706. [112] J. Rabae y C. Chu, P Hoang, and M. Potk onjak, F ast Prototyping of DatapathIntensi v e Architectures, IEEE Design and T est of Computer v ol. 8, no. 2, pp. 4051, June 1991. [113] J. Monteiro, S. De v adas, P Ashar and A. Mauskar Scheduling T echniques to Enable Po wer Management, in Pr oceedings of the A CM / IEEE Design A utomation Confer ence 1996, pp. 349352. [114] R. V Cherab uddi and M. A. Bayoumi, A lo w po wer based partitioning and binding technique for single chip application specic DSP architectures, in Pr oceedings of the Second Annual IEEE International Confer ence on Inno vative Systems in Silicon Oct 1997, pp. 350 361. [115] J. S. Lee, H. D. Lee, C. W P ark, and S.Y Hw ang, Po wer conscious scheduling algorithm for performancedri v en datapath synthesis, IEE Electr onics Letter s v ol. 32, no. 17, pp. 15741576, Aug 1996. [116] S. Gupta and S. Katk oori, F orcedirected scheduling for dynamic po wer optimization, in Pr oceedings of the IEEE Computer Society Annual Symposium on VLSI 2002, pp. 6873. [117] A. Muruga v el and N. Ranganathan, A Game Theoritic Approach for Binding in Beha vioral Synthesis, in Pr oceedings of the International Confer ence on VLSI Design Jan 2003, pp. 452458. [118] R. V Cherab uddi, M. A. Bayoumi, and H. Krishnamurthy A lo w po wer based system partitioning and binding technique for multichip module architectures, in Pr oceedings of the 7th Gr eat Lak es Symposium on VLSI Mar 1997, pp. 156162. [119] W T Shiue, High Le v el Synthesis for Peak Po wer Minimization using ILP, in Pr oceedings of the IEEE International Confer ence on Application Specic Systems, Ar c hitectur es and Pr ocessor s 2000, pp. 103112. [120] W T Shiue, Lo w Po wer VLSI Design : Peak Po wer Minimization using No v el Scheduling Algorithm Based on an ILP Model, in Pr oceedings of the 10th N ASA Symposium on VLSI Design Mar 2002. [121] W T Shiue, J. Denison, and A. Horak, A No v el Scheduler for Lo w Po wer Real T ime Systems, in Pr oceedings of the 43r d Midwest Symposium on Cir cuits and Systems Aug 2000, pp. 312315. [122] J. Pouwelse, K. Langendoen, and H.Sips, Dynamic V oltage Scaling on a Lo wPo wer Microprocessor in Pr oceedings of the 7th International Confer ence on Mobile Computing Network July 2001. 266 PAGE 284 [123] T Ishihara and H. Y asura, V oltage Scheduling Problem for Dynamic V ariable V oltage Processors, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design Aug 1998, pp. 197202. [124] T Okuma, T Ishihara, and H. Y asuura, Realtime task scheduling for a v ariable v oltage processor, in Pr oceedings of the 12th International Symposium on System Synthesis No v 1999, pp. 2429. [125] T Okuma, H. Y asuura, and T Ishihara, Softw are ener gy reduction techniques for v ariablev oltage processors, IEEE Design and T est of Computer s v ol. 18, no. 2, pp. 3141, Mar Apr 2001. [126] I. Hong, M. Potk onjak, and M. B. Sri v asta v a, Online scheduling of hard realtime tasks on v ariable v oltage processor, in Pr oceedings of the IEEE / A CM International Confer ence on Computer Aided Design No v 1998, pp. 653656. [127] I. Hong, D. Kiro v aski, G. Qu, M. Potk onjak, and M. B. Sri v asta v a, Po wer optimization of v ariablev oltage corebased systems, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 18, no. 12, pp. 17021714, Dec 1999. [128] M. M. Mansour M. M. Mansour I. Hajj, and N. Shanbhag, Instruction Scheduling for Lo w Po wer on Dynamically V ariable V oltage Processors, in Pr oceedings of the 7th IEEE International Confer ence on Electr onics, Cir cuits and Systems 2000, pp. 613618. [129] A. Aze v edo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. V eidenbaum, and A. Nicolau, Prolebased dynamic v oltage scheduling using program checkpoint, in Pr oceedings of the Design, A utomation and T est in Eur ope Confer ence and Exhibition 2002, pp. 168175. [130] A. Aze v edo, R. Cornea, I. Issenin R. Gupta, N. Dutt, A. Nicolau, and A. V eidenbaum, Ar chitectural and compiler strate gies for dynamic po wer management in the COPPER project in Pr oceedings of the International W orkshop on Inno vative Ar c hitectur e for Futur e Gener ation HighP erformance Pr ocessor s and Systems 2001, pp. 25 34. [131] V Sw aminathan and K. Chakrabarty In v estigating the ef fect of v oltageswitching on lo wener gy task scheduling in hard realtime systems, in Pr oceedings of the Asia and South P acic Design A utomation Confer ence 2001, pp. 251254. [132] V Sw aminathan and K. Chakrabarty Pruningbased ener gyoptimal de vice scheduling for hard realtime system, in Pr oceedings of the T enth International Symposium on Har dwar e / Softwar e Codesign 2002, pp. 175180. [133] C. H. Hsu, U. Kremer and M. Hsiao, Compiler Directed Dynamic Frequenc y and V oltage Scheduling, in Pr oceedings of the W orkshop on P ower A war e Computer Systems No v 2000, pp. 6581. [134] C. H. Hsu, U. Kremer and M. Hsiao, Compiler Directed Dynamic V oltage/Frequenc y Scheduling for Ener gy Reduction in Microprocessors, T ech. Rep., Departament of Computer Science, Rutgers Uni v ersity 2001. 267 PAGE 285 [135] Y H. Lee and C. M. Krishna, V oltageClock Scaling for Lo w Ener gy Consumption in RealT ime Embedded Systems, in Pr oceedings of the 6th International Confer ence on RealT ime Computing Systems and Applications 1999, pp. 272279. [136] F Y ao, A. Demers, and S. Shenk er A scheduling model for reduced CPU ener gy, in Pr oceedings of the 36th Annual Symposium on F oundations of Computer Science Oct 1995, pp. 374382. [137] J. Luo and N. K. Jha, Po wer prole dri v en v ariable v oltage scaling for heterogeneous distrib uted realtime embedded systems, in Pr oceedings of the 16th International Confer ence on VLSI Design 2003, pp. 369375. [138] J. Luo and N. K. Jha, Static and dynamic v ariable v oltage scheduling algorithms for realtime heterogeneous distrib uted embedded systems, in Pr oceedings of the 15th International Confer ence on VLSI Design 2002, pp. 719726. [139] J. Luo, S. Peh, and N. K. Jha, Simultaneous dynamic v oltage scaling of processors and communication links in realtime distrib uted embedded systems in Pr oceedings of the Design, A utomation and T est in Eur ope Confer ence and Exhibition 2003, pp. 11501151. [140] N. V ijaykrishnan, N. Ranganathan, and N. Bha v anishankar DFLAP : A Dynamic Frequenc y Linear Array Processor in Pr oceedings of the International Confer ence on Ima g e Pr ocessing 1996, pp. 10071010. [141] N. V ijaykrishnan, N. Ranganathan, and N. Bha v anishankar A Dynamic Frequenc y Linear Array Processor for Image Processing, in Pr oceedings of the International Confer ence on P attern Reco gnition 1996, pp. 611615. [142] V Krishna, N. Ranganathan, and N. V ijaykrishnan, Ener gy Ef cient Datapath Synthesis using Dynamic Frequenc y Clocking and Multiple V oltages, in Pr oceedings of the International Confer ence on VLSI 1999, pp. 440445. [143] V Krishna, N. Ranganathan, and N. V ijaykrishnan, An Ener gy Ef cient Scheduling Scheme for Signal Processing Applications, in Pr oceedings of the thirtysecond Asilomar Confer ence on Signal, Systems and Computer s (V ol. 2) 1998, pp. 10571061. [144] C. P apachristou, M. Spining, and M. Nourani, A Multiple Clocking Scheme for Lo w Po wer R TL Design, IEEE T r ansactions on VLSI Systems v ol. 7, no. 2, pp. 266276, June 1999. [145] T Burd, T Pering, A. Stratak os, and R. W Brodersen, A Dynamic V oltage Scaled Microprocessor System, in Pr oceedings of the IEEE International SolidState Cir cuits Confer ence Feb 2000, pp. 294295. [146] T Burd, T A. Pering, A. J. Stratak os, and R. W Brodersen, A Dynamic V oltage Scaled Microprocessor System, IEEE J ournal of SolidState Cir cuits v ol. 35, no. 11, pp. 1571 1580, No v 2000. [147] A. Acqua vi v a, L. Benini, and B. Ricco, Processor frequenc y setting for ener gy minimization of streaming multimedia application, in Pr oceedings of the 9th International Symposium on Har dwar e / Softwar e Codesign 2001, pp. 249253. 268 PAGE 286 [148] L. Benini, E. Macii, M. Pnocino, and G. De Micheli, T elescopic Units : A Ne w P aradigm for Performance Optimization of VLSI Design, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems v ol. 17, no. 3, pp. 220232, Mar 1998. [149] L. Benini, G. De Micheli, A. Lio y E. Macii, G. Odasso, and M. Poncino, Automatic Synthesis of Lar ge T elescopic Units Based on Near Minimum T imed Supersetting, IEEE T r ansactions on Computer s v ol. 48, no. 8, pp. 769779, Aug 1999. [150] V Raghunathan, S. Ra vi, and G. Lakshminarayana, HighLe v el Synthesis with V ariableLatenc y Components, in Pr oceedings of the International Confer ence on VLSI Design Jan 2000, pp. 220227. [151] K. J. No wka, G. D. Carpenter E. W MacDonald, H. C. Ngo, B. C. Brock, K. I. Ishii, T Y Nguyen, and J. L. Burns, A 32bit po werPC systemonachip with support for dynamic v oltage scaling and dynamic frequenc y scaling, IEEE J ournal of SolidState Cir cuits v ol. 37, no. 11, pp. 14411447, No v 2002. [152] K. No wka, G. Carpenter E. MacDonald, H. Ngo, B. Brock, K. Ishii, T Nguyen, and J. Burns, A 0.9 V to 1.95 V dynamic v oltagescalable and frequenc yscalable 32 b Po werPC processor in Pr oceedings of the International SolidState Cir cuits Confer ence (V ol. 1) 2002, pp. 340341. [153] Y H. Lu, L. Benini, and G. De Micheli, Dynamic frequenc y scaling with b uf fer insertion for mix ed w orkloads, IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and System v ol. 21, no. 11, pp. 12841305, No v 2002. [154] T Pering, T Burd, and R. W Brodersen, Dynamic V oltage Scaling and the Design of a Lo wPo wer Microprocessor System, in Pr oceedings of the W orkshop on P ower Driven Micr oar c hitectur e June 1998. [155] T Burd and R. W Brodersen, Design Issues for Dynamic V oltage Scaling, in Pr oceedings of the International Symposium on Low P ower Electr onics and Design 2000, pp. 914. [156] S. Hassoun and C. Ebeling, Architectural Retiming : Pipelining Latenc y Constrained Cir cuits, in Pr oceedings of the 33r d A CM / IEEE Design A utomation Confer ence 1996, pp. 708713. [157] S. No wick, Design of a lo wlatenc y asynchronous adder using speculati v e completion, IEE Pr oceedings on Computer Digital T ec hniques v ol. 143, no. 9, pp. 301307, Sep 1996. [158] L. D. Stryck er P T ermont, J. V ande we ge, J. Haitsma, A. Kalk er M. Maes, and G. Depo v ere, Implementation of a RealT ime Digital W atermarking Process for Broadcast Monitoring on T rimedia VLIW Processor, IEE Pr oceedings on V ision, Ima g e and Signal Pr ocessing v ol. 147, no. 4, pp. 371376, Aug 2000. [159] N. J. Mathai, D. K undur and A. Sheikholeslami, Hardw are Implementation Perspecti v es of Digital V ideo W atermarking Algortithms, IEEE T r ansanctions on Signal Pr ocessing 2003. 269 PAGE 287 [160] T H. Tsai and C. Y Lu, A System Le v el Design for Embedded W atermark T echnique using DSC System, in Pr oceedings of the IEEE International W orkshop on Intellig ent Signal Pr ocessing and Communication System 2001. [161] A. Garimella, M. V V Satyanarayan, R. S. K umar P S. Murugesh, and U. C. Niranjan, VLSI Impementation of Online Digital W atermarking T echniques W ith Dif ference Encoding for the 8bit Gray Scale Images, in Pr oceedings of the International Confer ence on VLSI Design 2003, pp. 792796. [162] A. Antola, V Piuri, and M. Sami, A Lo wRedundanc y Approach to SemiConcurrent Error Detection in Datapaths, in Pr oceedings of the Design A utomation and T est in Eur ope 1998, pp. 266272. [163] P K ollig and B. M. AlHashimi, Simultaneous Scheduling, Allocation and Binding in High Le v el Synthesis, IEE Electr onics Letter s v ol. 33, no. 18, pp. 15161518, Aug. 1997. [164] G. Fetweis, J. Chiu, and B. Fraenk el, A Lo wComple xity BitSerial DCT/IDCT Architecture, in Pr oceedings of the IEEE International Confer ence on Communications 1993, pp. 217221. [165] K. Balakrishnan, Peak Po wer Minimization through Datapath Scheduling using ILP Based Models, M.S. thesis, Uni v ersity of South Florida, Spring, 2003. [166] R. F ourer D. Gay and B. K ernighan, AMPL: A Modeling Langua g e for Mathematical Pr o gr amming Thomson Brooks Cole, 2003. [167] P E. Landman and J. M. Rabae y Architectural Po wer Analysis : The Dual Bit T ype Method, IEEE T r ansactions on VLSI Systems v ol. 3, no. 2, pp. 173187, Jun 1995. [168] J. H. Satyanarayan and K. K. P arhi, Theoritical Analysis of W ordLe v el Switching Acti vity in the Presence of Glitch and Correlation, IEEE T r ansactions on VLSI Systems v ol. 8, no. 2, pp. 148159, Apr 2000. [169] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, Analytical Estimation of Signal T ransition Acti vity from W ordLe v el Statistics, IEEE T r ansactions on CAD of Inte gr ated Cir cuits and Systems v ol. 16, no. 7, pp. 718733, Jul 1997. [170] S. P Mohanty N. Rangnathan, and S. K. Chappidi, ILP Models for Ener gy and T ransient Po wer Minimization During Beha vioral Synthesis, in Pr oceedings of the 17th International Confer ence on VLSI Design 2004, p. to appear [171] S. P Mohanty N. Rangnathan, and S. K. Chappidi, T ransient Po wer Minimization Through Datapath Scheduling in Multiple Supply V oltage En vironment, in Pr oceedings of the 10th IEEE International Confer ence on Electr onics, Cir cuits and Systems 2003, p. to appear [172] S. S. Rao, Engineering Optimization : Theory and Pr actice AddisonW esle y 1996. [173] M. J. P anik, Linear Pr o gr amming : Mathematics, Theory and Pr actice Kluwer Academic Publishers, 1996. 270 PAGE 288 [174] B. A. McCarl and T H. Spreen, Applied Mathematical Pr o gr amming using Alg ebric Systems Online Book at : http://agecon.tamu.edu/ f ac ul ty/ mccar l/regbo ok .htm, 1997. [175] S. P Mohanty N. Rangnathan, and S. K. Chappidi, Po wer Fluctuation Minimization During Beha vioral Synthesis using ILPBased Datapath Scheduling, in Pr oceedings of the 21st IEEE International Confer ence on Computer Design 2003, p. to appear [176] S. P Mohanty N. Ranganathan, and R. K. Namballa, VLSI Implementation of In visible Digital W atermarking Algorithms T o w ards the De v elopement of a Secure JPEG Encoder, in Pr oceedings of the IEEE W orkshop on Signal Pr ocessing Systems 2003, pp. 183188. [177] A. T ef as and I. Pitas, Rob ust Spatial Image W atermarking Using Progressi v e Detection, in Pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (V ol. 3) 2001, pp. 19731976. [178] F Bartolini, M. Barni A. T ef as, and I. Pitas, Image authentication techniques for surv eillance applications, Pr oceedings of the IEEE v ol. 89, no. 10, pp. 14031418, Oct 2001. [179] S. P Mohanty N. Rangnathan, and R. K. Namballa, VLSI Implementation of V isible W atermarking for a Secure Digital Still Camera (SCDC) Design, in Pr oceedings of the 17th International Confer ence on VLSI Design 2004, p. to appear [180] V P Nelson, H. T Nagle, J. D. Irwin, and B. D. Caroll, Digial Lo gic Analysis and Design Prentice Hall, Upper Saddle Ri v er Ne w Jerse y USA, 1995. [181] N. H. E. W este and K. Eshraghian, Principles of CMOS VLSI Design : A Systems P er spective Addison W esle y Boston, MA, USA, 1999. [182] I. J. Cox, J. Kilian, T Leighton, and T Shamoon, Secure Spread Spectrum W atermarking of Images, Audio and V ideo, in Pr oceedings of the IEEE International Confer ence on Ima g e Pr ocessing (V ol. 3) 1996, pp. 243246. [183] I.J.Cox, A secure rob ust w atermarking for multimedia, in Pr oc. of F ir st International W orkshop on Information Hiding 1996, v ol. 1174, pp. 185206. [184] M. Kaul, R. V emuri, S. Go vindarajan, and I. Ouaiss, An Automated T emporal P artitioning and Loop Fission approach for FPGA based recongurable synthesis of DSP applications, in Pr oceedings of the IEEE/A CM Design A utomation Confer ence 1999, pp. 616622. [185] S. Go vindarajan, I. Ouaiss, M. Kaul, V .Srini v asan, and R. V emuri, An Ef fecti v e Design System for Dynamically Recongurable Architectures, in Pr oceedings of the Sixth Annual IEEE Symposium on F ieldPr o gr ammable Custom Computing Mac hines 1998, pp. 312 313. [186] Karen Miller Assembly Langua g e Intr oduction to Computer Ar c hitectur e: Using the Intel P entium Oxford Uni v ersity Press, 1999. [187] J. B. Sulistyo and D. S. Ha, De v eloping Standard Cells for TSMC 0.25um T echnology under MOSIS DEEP Rules, T ech. Rep., Department of Electrical and Computer Engineering, V ir ginia T ech, VISC200202, 2002. 271 PAGE 289 ABOUT THE A UTHOR Saraju P Mohanty recei v ed the Bachelor of T echnology de gree in Electrical Engineering in 1995 from Colle ge of Engineering and T echnology Orissa Uni v ersity of Agriculture and T echnology Bhubanse w ar India. He recie v ed the Master of Engineering de gree in Systems Science and Automation from the Indian Institute of Science, Bangalore, India in 1999. He has taught se veral courses as instructor at department of Computer Science and Engineering, Uni v ersity of South Florida, USA and also at Colle ge of Engineering and T echnology Orissa Uni v ersity of Agriculture and T echnology Bhubanesw ar India. He has published se v eral research papers in areas of VLSI design automation, VLSI design and Digital w atermarking, and so on. His paper w as nominated for best paper a w ard at international conference in VLSI Design 2003. In the year 2002 and 2003, he recie v ed certicate of recognition from Pro v ost, Uni v ersity of South Florida for outstanding teaching. His research interests are HighLe v el Synthesis for Lo w Po wer Lo wPo wer VLSI Design for Multimedia Applications, Computer Architecture, Digital W atermarking. He is a member of IEEECS and A CMSIGD A. xml version 1.0 encoding UTF8 standalone no record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001437718 003 fts 006 med 007 cr mnuuuuuu 008 031105s2003 flua sbm s0000 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0000129 035 (OCoLC)53404407 9 AJM1699 b SE SFE0000129 040 FHM c FHM 090 TK7885 1 100 Mohanty, Saraju P. 0 245 Energy and transient power minimization during behavioral synthesis h [electronic resource] / by Saraju P Mohanty. 260 [Tampa, Fla.] : University of South Florida, 2003. 502 Thesis (Ph.D.)University of South Florida, 2003. 504 Includes bibliographical references. 500 Includes vita. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. Title from PDF of title page. Document formatted into pages; contains 289 pages. 520 ABSTRACT: The proliferation of portable systems and mobile computing platforms has increased the need for the design of low power consuming integrated circuits. The increase in chip density and clock frequencies due to technology advances has made low power design a critical issue. Low power design is further driven by several other factors such as thermal considerations and environmental concerns. In lowpower design for battery driven portable applications, the reduction of peak power, peak power differential, average power and energy are equally important. In this dissertation, we propose a framework for the reduction of these parameters through datapath scheduling at behavioral level. Several ILP based and heuristic based scheduling schemes are developed for datapath synthesis assuming : (i) single supply voltage and single frequency (SVSF), (ii) multiple supply voltages and dynamic frequency clocking (MVDFC), and (iii) multiple supply voltages and multicycling (MVMC). The scheduling schemes attempt to minimize : (i) energy, (ii) energy delay product, (iii) peak power, (iv) simultaneous peak power and average power, (v) simultaneous peak power, average power, peak power differential and energy, and (vi) power fluctuation. A new parameter called "Cycle Power Function" CPF) is defined which captures the transient power characteristics as the equally weighted sum of normalized mean cycle power and normalized mean cycle differential power. Minimizing this parameter using multiple supply voltages and dynamic frequency clocking results in the reduction of both energy and transient power. The cycle differential power can be modeled as either the absolute deviation from the average power or as the cycletocycle power gradient. The switching activity information is obtained from behavioral simulations. Power fluctuation is modeled as the cycletocycle power gradient and to reduce fluctuation the mean power gradient MPG is minimized. The power models take into consideration the effect of switching activity on the power consumption of the functional units. Experimental results for selected highlevel synthesis benchmark circuits under different constraints indicate that significant reductions in power, energy and energy delay product can be obtained and that the MVDFC and MVMC schemes yield better power reduction compared to the SVSF scheme. Several application specific VLSI circuits were designed and implemented for digital watermarking of images. Digital watermarking is the process that embeds data called a watermark into a multimedia object such that the watermark can be detected or extracted later to make an assertion about the object. A class of VLSI architectures were proposed for various watermarking algorithms : (i) spatial domain invisiblerobust watermarking scheme, (ii) spatial domain invisiblefragile watermarking scheme, (iii) spatial domain visible watermarking scheme, (iv) DCT domain invisiblerobust watermarking scheme, and (v) DCT domain visible watermarking scheme. Prototype implementation of (i), (ii) and (iii) are given. The hardware modules can be incorporated in a "JPEG encoder" or in a "digital still camera". 590 Adviser: Ranganathan, N. 653 dynamic frequency clocking. multiple supply voltages. low power synthesis. datapath scheduling. average power. peak power. power fluctuation. multicycling. digital watermarking. 690 Dissertations, Academic z USF x Computer Science and Engineering Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.129 