USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam 2200397Ka 4500 controlfield tag 001 002069488 005 20100422133438.0 007 cr mnuuuuuu 008 100422s2009 flu s 000 0 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0003280 035 (OCoLC)608497446 040 FHM c FHM 049 FHMM 090 TK7885 (Online) 1 100 Bhattacharya, Koustav. 0 245 Architectures and algorithms for mitigation of soft errors in nanoscale VLSI circuits h [electronic resource] / by Koustav Bhattacharya. 260 [Tampa, Fla] : b University of South Florida, 2009. 500 Title from PDF of title page. Document formatted into pages; contains 115 pages. Includes vita. 502 Dissertation (Ph.D.)University of South Florida, 2009. 504 Includes bibliographical references. 516 Text (Electronic dissertation) in PDF format. 520 ABSTRACT: The occurrence of transient faults like soft errors in computer circuits poses a significant challenge to the reliability of computer systems. Soft error, which occurs when the energetic neutrons coming from space or the alpha particles arising out of packaging materials hit the transistors, may manifest themselves as a bit flip in the memory element or as a transient glitch generated at any internal node of combinational logic, which may subsequently propagate to and be captured in a latch. Although the problem of soft errors was earlier only a concern for space applications, aggressive technology scaling trends have exacerbated the problem to modern VLSI systems even for terrestrial applications. In this dissertation, we explore techniques at all levels of the design flow to reduce the vulnerability of VLSI systems against soft errors without compromising on other design metrics like delay, area and power.We propose new models for estimating soft errors for storage structures and combinational logic. While soft errors in caches are estimated using the vulnerability metric, soft errors in logic circuits are estimated using two new metrics called the glitch enabling probability (GEP) and the cumulative probability of observability (CPO). These metrics, based on signal probabilities of nets, accurately model soft errors in radiationaware synthesis algorithms and helps in efficient exploration of the design solution space during optimization. At the physical design level, we leverage the use of larger netlengths to provide larger RC ladders for effectively filtering out the transient glitches. Towards this, a new heuristic has been developed to selectively assign larger wirelengths to certain critical nets. This reduces the delay and area overhead while improving the immunity to soft errors.Based on this, we propose two placement algorithms based on simulated annealing and quadratic programming which significantly reduce the soft error rates of circuits. At the circuit level, we develop techniques for hardening circuit nodes using a novel radiation jammer technique. The proposed technique is based on the principles of a RC differentiator and is used to isolate the driven cell from the driving cell which is being hit by a radiation strike. Since the blind insertion of radiation blocker cells on all circuit nodes is expensive, candidate nodes are selected for insertion of these cells using a new metric called the probability of radiation blocker circuit insertion (PRI). We investigate a gate sizing algorithm, at the logic level, in which we simultaneously optimize both the soft error rate (SER) and the crosstalk noise besides the power and performance of circuits while considering the effect of process variations.The reliability centric gate sizing technique has been formulated as a mathematical program and is efficiently solved. At the architectural level, we develop solutions for the correction of multibit errors in large L2 caches by controlling or mining the redundancy in the memory hierarchy and methods to increase the amount of redundancy in the memory hierarchy by employing a redundancybased replacement policy, in which the amount of redundancy is controlled using a user defined redundancy threshold. The novel architectures and the new reliabilitycentric synthesis algorithms proposed for the various design abstraction levels have been shown to achieve significant reduction of soft error rates in current nanometer circuits. The design techniques, algorithms and architectures can be integrated into existing design flows.A VLSI system implementation can leverage on the architectural solutions for the reliability of the caches while the custom hardware synthesized for the VLSI system can be protected against radiation strikes by utilizing the circuit level, logic level and layout level optimization algorithms that have been developed. 538 Mode of access: World Wide Web. System requirements: World Wide Web browser and PDF reader. 590 Advisor: Nagarajan Ranganathan, Ph.D. 653 Transient faults Design flow VLSI CAD Reliable architecture design Reliabilityaware design automation 690 Dissertations, Academic z USF x Computer Science and Engineering Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.3280 PAGE 1 Architectures and Algorithms for Mitigation of Soft Errors in Nanoscale VLSI Circuits by K ousta v Bhattacharya A dissertation submitted in partial fulllment of the requirements for the de gree of Doctor of Philosophy Department of Computer Science and Engineering Colle ge of Engineering Uni v ersity of South Florida Major Professor: Nagarajan Ranganathan, Ph.D. Srini v as Katk oori, Ph.D. Hao Zheng, Ph.D. Sanjukta Bhanja, Ph.D. Kandethody M. Ramachandran, Ph.D. Date of Appro v al: October 22, 2009 K e yw ords: T ransient F aults; Design Flo w; VLSI CAD; Reliable Architecture Design; Reliabilitya w are Design Automation Cop yright 2009,K ousta v Bhattacharya PAGE 2 DEDICA TION T o my mother who li v es in my heart PAGE 3 A CKNO WLEDGEMENTS I w ould lik e to tak e this opportunity to thank Professor N. Ranganathan for pro viding me the opportunity to w ork with him. He introduced me to v arious interesting problems and I am e xtremely fortunate for ha ving w ork ed with such a distinguished scholar and an eminent researcher lik e him. He has al w ays guided me with v aluable suggestions whene v er I w as in need. His continuous encouragement and emotional support during my dif cult times has been instrumental in shaping me to become a better researcher and more importantly a better person. I w ould also lik e to thank Professor Srini v as Katk oori, Professor Hao Zheng, Professor Sanjukta Bhanja and Professor Kandethody M. Ramachandran for their time in re vie wing this manuscript and their v aluable suggestions for impro ving its quality I am also thankful to Semiconductor Research Corporation (SRC) for supporting this research, in part, by a grant under the contract 2007HJ1596 and to National Science F oundation (NSF) Computing Research Infrastructure, in part, by a grant under the contract CNS0551621. My peers and friends from the lab especially Soumyaroop, Ziad, Mahalingam, Himanshu and Ransford ha v e made the years spent in pursuing this de gree feel shorter Their constructi v e ideas and discussions ha v e been e xtremely useful in impro ving the quality of this research. On a personal front, I w ould lik e to thank my dad, who at an early age ignited my mind with the will to succeed. My belated mother had al w ays been a constant source of inspiration for my w ork and inculcated the attitude in me that nothing is impossible with strong determination and willpo wer I w ould also lik e to thank my wife, Amrita, for her support during my tough times. PAGE 4 T ABLE OF CONTENTS LIST OF T ABLES iii LIST OF FIGURES i v ABSTRA CT vii CHAPTER 1 INTR ODUCTION 1 1.1 Moti v ation 2 1.2 Contrib utions and Signicance 5 1.3 Outline of Dissertation 8 CHAPTER 2 RELA TED W ORK 10 2.1 Pre vious W ork 10 2.2 Dissertation Conte xt in Light of P ast W orks 16 CHAPTER 3 ESTIMA TION MODELS 18 3.1 Background 18 3.2 Metrics to Estimate Soft Error Masking Ef fects 21 3.2.1 Estimating Logical Masking Ef fects 21 3.2.2 Estimating Electrical Masking Ef fects 24 3.2.3 Estimating T iming W indo w Masking Ef fects 24 3.2.4 Estimation of T iming Slack 26 CHAPTER 4 RADIA TION IMMUNITY A T PHYSICAL DESIGN LEVEL 27 4.1 Glitch Filtering in Interconnects 28 4.2 Placement for Radiation Immunity using Simulated Annealing 30 4.3 F ast SER A w are Placement using Quadratic Programming 34 4.4 Experimental Results 39 CHAPTER 5 SOFT ERR OR MITIGA TION A T CIRCUIT LEVEL 48 5.1 Radiation Induced Glitch Block er Circuit 49 5.2 Selecti v e Insertion Algorithm 52 5.3 Experimental Results 55 5.4 Comparison with Related W orks 58 CHAPTER 6 LOGIC LEVEL RELIABILITY CENTRIC GA TE SIZING 61 6.1 Interaction of V arious Noise Sources under Process Uncertainty 62 6.2 Logic Le v el Modeling of the Design Metrics 64 i PAGE 5 6.2.1 First Order Modeling of Glitch Masking Ef fects 65 6.2.2 Crosstalk Noise Modeling at the Logic Le v el 65 6.2.3 Po wer and T iming Models 67 6.3 Gate Sizing F ormulation 68 6.4 Experimental Results 72 CHAPTER 7 SOFT ERR OR T OLERANCE A T ARCHITECTURAL LEVEL 78 7.1 Characterization of Multibit Errors in Con v entional Caches 79 7.1.1 Probabilistic Characterization of Multibit Error Rate 80 7.1.2 V ulnerability of Con v entional Cache Or ganizations 81 7.2 Redundanc ybased Error Protection 82 7.2.1 Exploiting L1/L2 Redundanc y 82 7.2.2 Fine Grain Dirtiness 84 7.3 Impro ving Reliability by Controlling Redundanc y 85 7.3.1 Reliabilitycentric Replacement Polic y 86 7.3.2 Exploiting Small Data V alue Size 88 7.4 Experimental Setup and Results 90 7.4.1 Experimental Setup 91 7.4.2 Simulation Results 92 7.5 Comparison with Related W orks 101 CHAPTER 8 CONCLUSIONS 105 REFERENCES 109 LIST OF PUBLICA TIONS ABOUT THE A UTHOR End P age ii PAGE 6 LIST OF T ABLES T able 4.1 Simulated Annealing P arameters 40 T able 4.2 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.99 and 0.01 40 T able 4.3 SER Reduction for Radiation Immune SA Based Placement with Associated Delay and Area Ov erhead 43 T able 4.4 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.9 and 0.1 46 T able 4.5 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.9999 and 0.0001 47 T able 5.1 Ov erhead of Adding Radiation Block er Circuit for V arious Standard Cells 53 T able 5.2 Experimental Results for ISCAS'85 Benchmark Circuits 57 T able 6.1 Experimental Results on Benchmark Circuits 74 T able 7.1 Description of the Schemes Used in Experiments 90 T able 7.2 Baseline Processor Conguration 92 T able 7.3 Comparison with Recent W orks in Literature 103 iii PAGE 7 LIST OF FIGURES Figure 1.1 Impact of SER due to Gate Length Scaling in nm (Source Intel) [8] 2 Figure 1.2 Design Flo w for Soft Error T olerance in VLSI Systems 3 Figure 1.3 List of Contrib utions 6 Figure 2.1 T axonomy Diagram: W orks Related to T ransient F aults on Caches. 11 Figure 2.2 Soft Error Protection Using Circuit Le v el Optimizations 13 Figure 2.3 T axonomy Diagram of W orks in Gate Sizing Based on Optimization Metrics 14 Figure 3.1 Masking in Logic Circuits 20 Figure 3.2 Illustrating Computation of Logic Observ ability: (A) Signal Probabilities of Nets, (B) GEP V alues at Internal Gate Inputs, (C) Logical Observ ability V alues at V arious Gate Outputs 23 Figure 3.3 NRC for an In v erter at V arying Capaciti v e Loads [59] 25 Figure 4.1 Interconnects Modeled as RC Ladder 28 Figure 4.2 Ef fect of W irelengths on Glitch Reduction 29 Figure 4.3 Cost Function with Penalty for High Area and T otal W irelength. (Note: All v alues are in generic units) 32 Figure 4.4 Placement of c432 Benchmark Using QP (A) w1=0.1 w2=0.9, (B) w1=0.9 w2=0.1 37 Figure 4.5 Placement of c432 Benchmark Using QP (A) w1=0.01 w2=0.99, (B) w1=0.99 w2=0.01 38 Figure 4.6 Area Comparison for Dif ferent Placement Schemes. (Note: All v alues are in generic units) 41 Figure 4.7 W irelength Comparison for Dif ferent Placement Schemes. (Note: All v alues are in generic units) 41 Figure 4.8 Placement of c432 Benchmark Using QP (A) w1=0.0001 w2=0.9999, (B) w1=0.9999 w2=0.0001 44 i v PAGE 8 Figure 4.9 Speedup Comparison of QP Based and SA Based Radiation Immune Placement Schemes 47 Figure 5.1 Schematic of Radiation Induced Glitch Block er Circuit 49 Figure 5.2 Plotting the V oltages across M1 50 Figure 5.3 (A) T ransient Pulses on In v erter Cell for Radiation Strik es of V arying strength, (B) Corresponding Results on an In v erter Cell Protected with Radiation Block er Circuit 51 Figure 5.4 Simulation Flo w: SER Reduction by Using Radiation Block er Circuits 55 Figure 5.5 Layout of the Radiation Block er Circuitry 56 Figure 5.6 Comparison of SER Reduction for Dif ferent User Dened P arameters 58 Figure 5.7 Comparison of Delay Ov erhead for Dif ferent User Dened P arameters 59 Figure 5.8 Comparison of Area Ov erhead for Dif ferent User Dened P arameters 60 Figure 6.1 Interaction of Soft Errors and Crosstalk Noise 63 Figure 6.2 Soft Error Mitigation under Process Uncertainty 64 Figure 6.3 First Order Model on Soft Errors of Logic Circuits with V arying Gate Size 64 Figure 6.4 Modeling Crosstalk Noise using Graph Clustering based on Rent' s Exponent V alues 66 Figure 6.5 Simulation Flo w for Reliabilitycentric Gate Sizing Under Process V ariations 73 Figure 6.6 A v erage T iming Y ield at Dif ferent T iming Mar gins 75 Figure 6.7 SER Reduction for ISCAS85 benchmarks 76 Figure 6.8 Impro v ement in SER, Crosstalk Noise and Po wer 76 Figure 7.1 V ulnerability of Dif ferent Cache Or ganizations for SPECINT2000. 79 Figure 7.2 V ulnerability of Dif ferent Cache Or ganizations for SPECFP2000. 80 Figure 7.3 Illustrating Inclusion Property and Fine Grain Dirtiness 83 Figure 7.4 Illustrating Reliabilitycentric Replacement and Small V alue Duplication 85 Figure 7.5 Hardw are Architecture for Small V alue Detection and Duplication 89 Figure 7.6 V ulnerability of the L2 Cache for V arious Schemes Proposed for SPECINT2000. 93 v PAGE 9 Figure 7.7 V ulnerability of the L2 Cache for V arious Schemes Proposed for SPECFP2000. 93 Figure 7.8 Global Miss Rates of the L2 Cache for V arious Schemes Proposed for SPECINT2000. 95 Figure 7.9 Global Miss Rates of the L2 Cache for V arious Schemes Proposed for SPECFP2000. 95 Figure 7.10 IPCs for V arious Schemes Proposed for SPECINT2000. 97 Figure 7.11 IPCs for V arious Schemes Proposed for SPECFP2000. 98 Figure 7.12 Write Back T raf c Rate to the Main Memory for V arious Schemes Proposed for SPECINT2000. 99 Figure 7.13 Write Back T raf c Rate to the Main Memory for V arious Schemes Proposed for SPECFP2000. 99 Figure 7.14 Area Ov erhead for a L2 Cache with Redundanc y Based Error Protection Compared to a Baseline L2 Cache without Error Protection 100 Figure 7.15 A v erage Dynamic Po wer Consumption for a L2 Cache with a 8KB ECC Cache Compared with Baseline L2 Cache without Error Protection 101 Figure 7.16 A v erage Leakage Po wer Consumption for a L2 Cache with Small ECC Cache with Fix ed Number of Blocks for Dif ferent Sized Multibit Errors 102 vi PAGE 10 ARCHITECTURES AND ALGORITHMS FOR MITIGA TION OF SOFT ERR ORS IN N ANOSCALE VLSI CIRCUITS K ousta v Bhattacharya ABSTRA CT The occurrence of transient f aults lik e soft errors in computer circuits poses a signicant challenge to the reliability of computer systems. Soft error which occurs when the ener getic neutrons coming from space or the alpha particles arising out of packaging materials hit the transistors, may manifest themselv es as a bit ip in the memory element or as a transient glitch generated at an y internal node of combinational logic, which may subsequently propagate to and be captured in a latch. Although the problem of soft errors w as earlier only a concern for space applications, aggressi v e technology scaling trends ha v e e xacerbated the problem to modern VLSI systems e v en for terrestrial applications. In this dissertation, we e xplore techniques at all le v els of the design o w to reduce the vulnerability of VLSI systems against soft errors without compromising on other design metrics lik e delay area and po wer W e propose ne w models for estimating soft errors for storage structures and combinational logic. While soft errors in caches are estimated using the vulnerability metric, soft errors in logic circuits are estimated using tw o ne w metrics called the glitch enabling probability (GEP) and the cumulati v e probability of observ ability (CPO). These metrics, based on signal probabilities of nets, accurately model soft errors in radiationa w are synthesis algorithms and helps in ef cient e xploration of the design solution space during optimization. At the physical design le v el, we le v er age the use of lar ger netlengths to pro vide lar ger RC ladders for ef fecti v ely ltering out the transient glitches. T o w ards this, a ne w heuristic has been de v eloped to selecti v ely assign lar ger wirelengths to certain critical nets. This reduces the delay and area o v erhead while impro ving the immunity to vii PAGE 11 soft errors. Based on this, we propose tw o placement algorithms based on simulated annealing and quadratic programming which signicantly reduce the soft error rates of circuits. At the circuit le v el, we de v elop techniques for hardening circuit nodes using a no v el radiation jammer technique. The proposed technique is based on the principles of a RC dif ferentiator and is used to isolate the dri v en cell from the dri ving cell which is being hit by a radiation strik e. Since the blind insertion of radiation block er cells on all circuit nodes is e xpensi v e, candidate nodes are selected for insertion of these cells using a ne w metric called the probability of radiation block er circuit insertion (PRI). W e in v estigate a gate sizing algorithm, at the logic le v el, in which we simultaneously optimize both the soft error rate (SER) and the crosstalk noise besides the po wer and performance of circuits while considering the ef fect of process v ariations. The reliability centric gate sizing technique has been formulated as a mathematical program and is ef ciently solv ed. At the architectural le v el, we de v elop solutions for the correction of multibit errors in lar ge L2 caches by controlling or mining the redundanc y in the memory hierarchy and methods to increase the amount of redundanc y in the memory hierarchy by emplo ying a redundanc ybased replacement polic y in which the amount of redundanc y is controlled using a user dened redundanc y threshold. The no v el architectures and the ne w reliabilitycentric synthesis algorithms proposed for the v arious design abstraction le v els ha v e been sho wn to achie v e signicant reduction of soft error rates in current nanometer circuits. The design techniques, algorithms and architectures can be inte grated into e xisting design o ws. A VLSI system implementation can le v erage on the architectural solutions for the reliability of the caches while the custom hardw are synthesized for the VLSI system can be protected against radiation strik es by utilizing the circuit le v el, logic le v el and layout le v el optimization algorithms that ha v e been de v eloped. viii PAGE 12 CHAPTER 1 INTR ODUCTION The race to inno v ate has led to unprecedented progress in the eld electronic computing. This has attrib uted to the ubiquitous use of VLSI systems in personal computers and lar ge scale serv ers, portable and mobile electronic systems lik e laptop computers, cellular phones, music players and v arious embedded computing systems deplo yed in tele visions, cars and in almost e v ery consumer electronic systems. As this re v olution continues, the cost of computing decreases e v en further and applications which were economically infeasible slo wly become practical. The high rate of gro wth in VLSI technology is sustained by scaling the minimum feature sizes of transistors to smaller and smaller dimensions along with the continuous reduction in the operating supply and threshold v oltages. This trend in technology scaling has helped the design of modern VLSI systems for higher performance and lo wer po wer consumption. Higher inte gration densities, increase in operating frequencies and reduction of supply v oltages, ho we v er mak e reliability of these systems a k e y concern. High electric elds in scaled de vices which occur due to ef fects lik e hot carrier injection (HCI) and ne gati v e bias temperature instability (NBTI) often manifest themselv es as an increase in the threshold v oltage and can lead to de vice slo wdo wn and e v entually result in the timing f ailure of the circuit. The strong electric elds in wires often cause momentum transfer during collisions between conducting electrons and metal atoms which can lead to shorts or opens in interconnects. Ho we v er HCI, NBTI, electromigration and other ef fects due to high electric elds generate o v er time causing permanent f ailures and hence can only impact the a v ailability and lifetime of the designs. T r ansient faults on the other hand, creates intermittent f aults in VLSI systems. These can occur due to se v eral reasons lik e soft errors, po wer supply and interconnect noise, and electromagnetic interference. Soft errors occur when the ener getic neutrons coming from space or the alpha particles 1 PAGE 13 Figure 1.1 Impact of SER due to Gate Length Scaling in nm (Source Intel) [8] arising out of packaging materials hit the transistors. When such high ener gy ions strik e the dif fusion re gion of VLSI circuits a v oltage spik e may be generated on the af fected circuit node. A v oltage spik e of high magnitude can result in a soft error on the circuit. A soft error may manifest itself as a bit ip in a latch or memory element. Additionally soft errors can occur in an y internal node of a combinational logic and subsequently propagate to and be captured in a latch. Radiation induced soft error is one of the biggest contrib utors to transient f aults and present the biggest challenges to w ards the reliability of VLSI systems implemented with the current nanometer technologies. 1.1 Moti v ation In the past, soft errors ha v e been a signicant problem only in space applications. Ho we v er the recent trends in technology scaling ha v e hugely decreased the radiation immunity of electronic cir cuits and ha v e made nanometer designs highly susceptible to transient f aults. Figure 1.1 illustrates that soft error rate of se v eral VLSI processor systems designed at Intel, has gro wn e xponentially due to technology scaling trends. Moreo v er while space applications could use adv anced f abrication technology and packaging material to reduce soft errors, such techniques are typically quite e xpensi v e to implement in lo w cost commercial systems. VLSI systems are increasingly being designed as SystemonChip implementations. SystemonChips are being designed for a wide range of applications, from general purpose computing to special purpose VLSI systems. General purpose computing implementations on SystemonChip 2 PAGE 14 ......Constraints Offtheshelf components Commercial Tools and of this Dissertation Developed as part Place and Route Flow ASIC/FPGA ASIC/FPGA High Level Synthesis Processor Core(s) Offtheshelf Delay, Area and Soft Error Rate Power Metrics with Onchip L2 Cache into Hardware protected against System Specifications HW/SW Partioning Behavioral Code to be executed on Custom Processor Behaviorial Code to be synthesized Software Compiler Assembly/Machine Code GenerationSystem BusMemory and I/O Multibit errors Logic Level Optimizations Technology mapping Transistor Level Optimizations Figure 1.2 Design Flo w for Soft Error T olerance in VLSI Systems 3 PAGE 15 require lar ge caches and of ftheshelf processors to enhance the performance of the applications. On the other hand, special purpose applications implemented on SystemonChips ha v e lar ge amount of special purpose hardw are which are synthesized from system le v el R TL code. The chip manuf acturers typically set b udgets on soft error rates on such systems which should be met by the resulting design with lo w o v erheads in performance and po wer An ef fecti v e approach for mitigation of soft error ef fects for such a system is to implement steps for reliability against soft errors in the design o w itself. Ho we v er such a unied approach for soft error mitigation at v arious design abstraction le v els has ne v er been tried before in an y prior research. Addressing soft error issues in such a unied design o w gi v es the system designer the opportunity to weigh up the implications of dedicating more resources for soft error detection and pre v ention against the correlating impact on delay po wer and area. Memory structures ha v e been considered as dominant sources of transient errors in VLSI systems [1], [2], [3], [4], [47]. These include onchip caches, DRAMs, re gister les, and other onchip memory structures in VLSI systems. Although, with technology scaling the Soft Error Rate in SRAMs has remained constant for a gi v en cache size, the rate of multibit err or s has increased signicantly with the shrinking de vice geometries. The rate of multibit errors increases accross technology generations as de vice feature sizes shrink. Moreo v er multibit errors tend to de v elop o v er time in lar ge caches which are typical in current VLSI systems with high memory inte gration density Architectural strate gies for pre v ention of soft errors in such lar ge caches has not been e xplored before in prior research. Further technology trends lik e smaller feature sizes, lo wer v oltage le v els, higher operating frequenc y and reduced logic depth, are projected to increase the softerror rate (SER) in combinational logic as well [28], [8]. In a recent study [32], the SER of logic circuits w as quantied in technology nodes from 600nm to 50nm and it w as projected that by 2011, the SER in logic circuits will increase by nine orders of magnitude and will essentially be comparable to that of unprotected memory Thus, there is an imminent need for no v el algorithms for synthesizing soft error tolerant combinational logic circuits in a design o w The current w ork lls a major gap in this direction. 4 PAGE 16 Thus, in summary the moti v ation for this dissertation is to e xplore the core issues in problem of soft errors, and de v elop a design o w frame w ork that optimizes soft error rate at v arious design abstraction le v els and in particular encompass,Models for ef cient estimation and optimization of soft errors at all the design abstraction le v els.Layout le v el optimization schemes geared to w ards mitigation soft errors during automated physical design.No v el circuit le v el techniques for mitigation soft errors with lo w o v erheads in delay area and po wer .Lo w o v erhead techniques for mitigation of soft errors in the logic le v el especially under the inuence of other uncertain noise ef fects.Ef cient architectural solutions that handle multibit errors in hardw are storage structures found in current VLSI systems. A design o w frame w ork incorporating the architectural solutions for the reliability of the stor age structures and no v el softerror a w are synthesis algorithms at v arious design abstraction le v els can be used to implement VLSI systems which are completely immune to radiation induced reliability issues. 1.2 Contrib utions and Signicance In this dissertation, we in v estigate the de v elopment of a unied design o w frame w ork for mitigation of soft errors. Se v eral design and circuit optimization techniques applicable at v arious le v els of hardw are design are e xplored to impro v e the reliability of computing systems. W e ha v e e xperimented signicantly in de v eloping no v el techniques at each le v el of the design o w to reduce the impact of soft errors in VLSI systems. Figure 1.2 illustrates a design o w frame w ork in VLSI System on Chips. As sho wn in the gure, the techniques de v eloped as part of this dissertation can 5 PAGE 17 Selective Insertion for Soft Error Mitigation Framework for Correction of Multibit Soft Errors in L2 Caches based on Redundancy Reliabilitycentric Gate Sizing considering Uncertainty due to Process Variations Optimization Soft Error Layout Level Circuit Level Optimization Optimization Modeling Major Contributions Optimization Logic Level Architectural Level Placement for Radiation Immunity in Cell Errors at Various Design Abstraction Levels Efficient and Accurate Modeling of Soft Based Design of Nanometer Circuits Novel Radiation Blocker Circuit and its Figure 1.3 List of Contrib utions be inte grated into such a unied design o w frame w ork to signicantly reduce the SER of a VLSI system with lo w o v erheads in delay area and po wer The theme and the major research w orks pertaining to this dissertation are summarized in Figure 1.3. The k e y contrib utions of this dissertation can be described as follo ws,W e propose ef cient metrics for estimation of soft errors at v arious design abstraction le v els. Soft errors of storage structures lik e caches are estimated using the vulnerability metric while soft errors in logic circuits are estimated by using tw o ne w metrics called the glitch enabling probability (GEP) and the cumulati v e probability of observ ability (CPO) dened based on the signal probabilities of the nets. These metrics accurately model soft errors in radiationa w are synthesis algorithms and helps in ef cient e xploration of the design solution space.W e de v elop ne w algorithms for radiation a w are automatic physical design by intelligently modifying the placement stage in cell based designs. Lar ger netlengths can pro vide lar ger 6 PAGE 18 RC ladders to ef fecti v ely lter out the transient glitches. T o w ards this, a ne w heuristic has been de v eloped to selecti v ely assign lar ger wirelengths to certain critical nets to increase the radiation immunity of circuits with lo w delay and area o v erhead. Based on this, we propose tw o placement algorithms based on simulated annealing and quadratic programming that signicantly reduce the SER in standard cell based designs of logic circuits.W e propose a transistor le v el circuit optimization technique based on a radiation block er cir cuit which signicantly reduces the propagation of random glitches due to radiation strik es. The radiation block er circuit can ght transient glitches on standard cell outputs due to random radiation strik es by using a RC dif ferentiator circuit. The circuit is used to isolate the dri v en cell from the dri ving cell which is being hit by a radiation strik e. An algorithm based on ranking circuit nodes using a ne w metric called the pr obability of r adiation bloc k er cir cuit insertion (PRI) has been de v eloped. The radiation block er cells are inserted only on the top fe w nodes in the sorted list of PRI v alues.W e de v elop a gate sizing algorithm at the logic le v el of design abstraction that can jointly optimize the circuit against both radiation induced soft errors along with compounded noise ef fects of capaciti v e crosstalk. Based on a rst order model to w ards size dependence of logic gates for soft errors and ef cient modeling of crosstalk noise and process v ariations a reliability centric gate sizing algorithm under process v ariation has been formulated. This multimetric gate sizing frame w ork has been formulated as a nonlinear mathematical program and is ef ciently solv ed.W e propose architectural le v el techniques for correction of multibit errors in the L2 cache by e xploiting the redundanc y e xisting between the writethrough L1 cache and the L2 cache and the redundanc y e xisting between the clean data lines of the L2 cache and the main memory W e also de v elop methods to increase the amount of redundanc y in the memory hierarchy by emplo ying a redundanc ybased replacement polic y in which the amount of redundanc y being controlled is based on a redundanc y threshold. W e also in v estigate techniques to mine 7 PAGE 19 redundanc y at the w ord le v el by duplicating small memory v alues in the upper half of the memory w ord. Thus, we ha v e e xplored both no v el architectures and no v el reliabilitycentric synthesis algorithms for impro ving the vulnerability against soft errors and achie v ed signicant reduction of soft error rates in VLSI systems. The design techniques, algorithms and architectures de v eloped can be inte grated into e xisting design o ws. A VLSI system implementation can le v erage on the architectural solutions for the reliability of the caches while the custom hardw are synthesized for the VLSI system can be protected against radiation strik es by utilizing the circuit le v el, logic le v el and layout le v el optimization algorithms that that has been de v eloped. 1.3 Outline of Dissertation The remainder of this dissertation is or ganized as follo ws: Chapter 2 describes the related w ork pertaining to our research. In Chapter 3, we de v elop no v el metrics for modeling of soft errors in VLSI circuits. These metrics are used e xtensi v ely throughout the dissertation. In Chapter 4, we sho w that higher wirelengths for nets can act as a lar ger RC ladder and can also ef fecti v ely lter out transient glitches due to radiation strik es. Based on this, we present tw o placement algorithms that place standard cells in a w ay to pro vide higher wirelengths for soft error critical nets while simultaneously constraining the chip area and the total wirelength. W e sho w that such placement algorithms can signicantly reduce the SER of logic circuits. Chapter 5 we propose a circuit le v el technique based on a RC dif ferentiator circuit which can be inserted at the output of a logic cell to pre v ent the generation of transient glitches due to radiation strik es. The radiation block er circuit has the conguration of a RC dif ferentiator and is used to cutof f the dri v en cell from the dri ving cell which is hit by a radiation strik e. Further in that chapter we propose an algorithm to insert radiation block er cells only on selected nodes in a logic circuit. The algorithm is based on ranking the circuit nodes based on a ne w metric called the pr obability of r adiation bloc k er cir cuit insertion (PRI) and inserting these cells only on the top fe w nodes in the sorted list of PRI v alues. In Chapter 6, we de v elop a rst order model of the soft error phenomenon in logic circuits and incorporate po wer and delay metrics to formulate a con v e x programming based reliabilitycentric gate sizing 8 PAGE 20 technique. In Chapter 7, we in v estigate in detail the multibit soft error rates in lar ge L2 caches and propose a frame w ork of solutions for their correction based on the amount of redundanc y present in the memory hierarchy W e in v estigate se v eral ne w techniques for reducing multibit errors in lar ge L2 caches, in which, the multibit errors are detected using simple error detection codes and corrected using the data redundanc y in the memory hierarchy W e also propose se v eral techniques to control/mine the redundanc y in the memory hierarchy to further impro v e the reliability of the L2 cache. The concluding remarks and the suggested future w ork in terms of e xtensions to the problems addressed in this dissertation, and other ideas for further renements are gi v en in Chapter 8. 9 PAGE 21 CHAPTER 2 RELA TED W ORK The trends in technology scaling ha v e helped the design of modern microprocessors for higher performance and lo wer po wer consumption through the rapid shrinking of the minimum feature size as well as the reduction of supply v oltages [8]. At the same time, microprocessors are being b uilt with higher de gree of spatial parallelism and deeper pipelines to increase the clock frequenc y [6]. Unfortunately ho we v er these trends mak e them more susceptible to transient f aults [7 11]. Se v eral dif ferent strate gies ha v e been in v estigated in the past to a v oid, detect and reco v er from soft errors [12]. These solutions are applied at v arious le v els of the system, from process technology circuit to microarchitecture le v els. In this chapter we rst re vie w the v arious pre vious w orks found in literature for soft error mitigation at v arious design abstraction le v els and then present the conte xt of this dissertation in the light of these pre vious w orks. 2.1 Pr e vious W ork Memory structures ha v e been considered as dominant sources of transient errors in computer systems [1 4, 47]. These include onchip caches, DRAMs, re gister les, and other onchip memory structures. As sho wn in the taxonomy diagram gi v en in Figure 2.1, the L2 caches ha v e been traditionally protected against soft errors using Err or corr ection codes (ECC) codes [1, 2, 4]. The tasks of detecting and correcting soft errors using ECC codes, ho we v er incur a lar ge penalty in area. F or e xample, double err or corr ection and double err or detection (DECDED) codes require 14 bits, for each 64bit memory w ord, corresponding to a 22% area o v erhead. Multibit error protection using sophisticated ECC protection will also require more bit lines and wider sense ampliers thus increasing the cache access latenc y and po wer consumption. Spatial multibit errors can also be a v oided by using layout le v el techniques lik e physical interlea ving [15]. Ho we v er with higher in10 PAGE 22 UltraSparc T1, 2005 Slayman, 2005 Bit Interleaving Zhang, 2004 Replication Cache Zhang et. al., 2003 Incache replication This Work Gold et. al., 2007 Using Last Store Prediction 2D Error Coding Kim et. al., 2007 Transient faults in Caches Priority based SECDED Multibit errors Parity Caching Kim and Somani, 1999 Temporal multibit errors Spatial multibit errors Cache Line Mitra et. al., 2005 Wider ECC words Li et. al., 2004 Adaptive Error Coding Bossen et. al, 2002 POWER 4, Itanium processor Quach, 2000 Yeager, 1996 MIPS R10000 Kim and Somani, 1999 Selective Checking Kim and Somani, 1999 Shadow Checking Wang et. al., 2008 Self Adaptive Caches Hu et. al, 2009 Compiler Assisted Detection Scrubbing Simple Parity/SECDED Singlebit errors Figure 2.1 T axonomy Diagram: W orks Related to T ransient F aults on Caches. terlea ving f actors multiple w ord lines are needed to be dri v en and data need to regrouped or routed for read/write operations, thus increasing the cache access latenc y Multibit errors can be a v oided by correcting singlebit errors during scrubbing, before the y de v elop into temporal multibit errors by another particle strik e. Ho we v er choosing the right scrub interv al is often dif cult [16]. Most importantly scrubbing cannot eliminate spatial multibit errors since spatial multibit errors occur due to a single particle strik e rather than e v olving o v er time. Se v eral schemes ha v e been proposed in the literature to reduce the area o v erhead associated with protecting memory by ECC codes [17]. In [18], error protection is suggested for frequently accessed cache lines. In [19], the authors described the use of a dead block prediction technique to hold the cop y of data found in acti v e cache blocks. A lar ger ECC w ord can also be used to compensate for the area o v erhead [20]. Ho we v er since the unit of memory read/write is based on w ord granularity each memory read/write requires reading se v eral data w ords to generate SECDED check bits. In [21], a small fully associati v e replication cache is maintained to maintain replicas 11 PAGE 23 of writes which are used to detect and correct errors. In [22], the authors ha v e mentioned of using redundanc y for area ef cient error protection. Ho we v er detailed results in the conte xt of multibit errors ha v e not been pro vided. Recently in [23 25], se v eral techniques ha v e been proposed for area ef cient multibit error correction. In [23], the authors ha v e proposed to reduce the multibit softerrors of L1 caches using last store prediction. In [24], the authors ha v e proposed the use of tw odimensional error codes which can correct clustered 32x32 errors with signicantly smaller o v erheads in area, performance and po wer The soft errors that do not af fect the program output are considered benign as no error is observ ed by the user This situation can occur for e xample, in branch prediction logic or in the instructions from the misspeculated e x ecution sequences which ne v er commit and thus, will ne v er lead to visible error states. Soft errors which af fect the program output are typically dened in terms of failur es in time (FIT) [13, 14]. The chip manuf acturers typically set b udgets on soft error rates which should be met by the design. Single e v ent transients can also occur in an y internal node of a combinational logic and subsequently propagate to and be captured in a latch. Although, soft errors ha v e been a greater concern for memory elements, technology trends lik e smaller feature sizes, lo wer v oltage le v els, higher operating frequenc y and reduced logic depth, are projected to increase the softerror rate (SER) in combinational logic be yond that of unprotected memory elements [8, 28]. A taxonomy diagram illustrating the v arious approaches for protecting logic circuits against soft errors is sho wn in Figure 2.2. As sho wn, soft errors can be pre v ented in logic circuits by using v arious circuit le v el optimization techniques. In [35], time redundanc y is e xploited to detect and reco v er from softerrors. In [47], a technique for correction of logic soft errors using celements has been proposed. In [85], concurrent error detection circuits are added to nodes in logic circuits which ha v e high soft error susceptibility In [36], soft error protection in domino logic and sequential cells is achie v ed by e xplicitly adding capacitors to the feedback node. In [40], the idea ha v e been e xtended to combinational logic circuits. Ho we v er as the stored char ge in the k eeper becomes smaller due to technology scaling, the technique becomes inef cient in ghting transient glitches due to radiation strik es. In [39], gates are locally duplicated and the duplicate nodes are connected by a v oltage clamper 12 PAGE 24 Simultaneous DualVdd and Sizing, for Radiation Hardness Simultaneous Optimization SER Reduction Gate sizing for Sizing using Convex Programming, Zhou and Mohanram, 2006 Mohanram and Touba, 2003 Time Redundancy, Nicolaidis, 1999 Selective Cell Duplication, Temporal Redundancy Choudhury et. al., 2006 This Work Nagpal et. al., 2008 Using CWSP elements, Circuit Level Node Hardening for SER Reduction Garg et. al., 2006 Shadow Gate with Voltage Clamper Circuit, Sasaki et. al., 2006 Masking using Schmitt Trigger Circuit, Kumar and Tahoori, 2005 Glitch filtering by Pass transistors, Karnik et. al., 2002 Explicit Capacitor Feedback Rao et. al., 2006 Simultaneous Sizing and F/F Selection, Exploiting Spatial or SER Reduction in Logic Circuits Figure 2.2 Soft Error Protection Using Circuit Le v el Optimizations circuit. This pre v ents the output node of the gate and its duplicate node not to de viate in v oltage due to a radiation strik e. This technique, ho we v er doubles the area and po wer o v erhead. The ef fect is especially se v ere for comple x cells or cells with higher dri v e strengths. Adding shado w gates for such cells with lar ge silicon footprint mak es the area and po wer o v erhead signicant. In [37] the logic gates that are bombarded by radiation strik es are isolated using complimentary pass gates. The complimentary pass gates act as a lo w pass lter and lter out transient v oltage pulses due to a radiation strik e. In [41] a class of soft error masking circuits is proposed using a schmitt trigger circuit. These techniques, ho we v er can achie v e a mar ginal reduction in the radiation induced glitch magnitude b ut cannot completely eliminate the transient. Placement is the process of arranging the circuit components on a layout surf ace to minimize a certain cost metric. This cost metric may be the o v erall chip area, which is simply the area of 13 PAGE 25 Murugavel and Ranganathan, '04 Game Theory based Sapatnekar et. al., '93 Convex Optimization Sinha and Zhou, '04 Game Theory Optimization Security Aware Path balancing Bhattacharya and Ranganathan, '08 Power Optimization Crosstalk and Total Iterative Sizing Fishburn and Dunlop, '85 Sizing on selective nodes Unconstrained Delay Minimization under Delay constraints Power Minimization Soft Error Rate Minimization Linear Programming Minimization Crosstalk Noise Xiao and MarekSadowska, '99 Lagrangian Relaxation Rao et. al., '06 Sizing and FF selection Berkelaar, '90 Power Minimization under Uncertainty Delay Uncertainty Geometric Programming Hashimoto and Onodera, '00 Crosstalk Noise and Crosstalk Minimization Soft Error Rate and Ranganathan et. al., '08 Fuzzy Programming Stochastic Games Hanchate and Ranganathan, '07 This Work Zhou and Mohanram, '06 SSTA based Linear Programming Delay Minimization under Uncertainty Gate Sizing Metrics Traditional Optimization Process Uncertainty Hanchate and Ranganathan, '06 Bhardwaj et. al., '06 Leakage Power optimization Mahalingam and Ranganathan, '06 Fuzzy Programming Optimization Considering Singh et. al., '05 Mani and Orshansky, '04 Stochastic Programming Figure 2.3 T axonomy Diagram of W orks in Gate Sizing Based on Optimization Metrics bounding box enclosing the circuit components, or the total wirelength. Good total wirelength costs not only predict routability and routing area b ut also pro vide easy to compute rough estimate of the circuit delay In general, the placement problem is a NPcomplete problem e v en for the simplest case of 1D placement [49]. Therefore, se v eral heuristic algorithms ha v e been proposed that can gi v e good solutions in reasonable amount of time. An ef cient and unique representation of a v alid placement conguration is through the use of sequence pairs [53]. Placement algorithms ha v e been used for impro ving po wer delay [49], crosstalk [57], routability [51], parametric yield [50] etc. Ho we v er we sho w that placement can be used for optimizing netlength distrib ution and can be used as an ef fecti v e tool for reducing circuit SER. The reduction of SER can be achie v ed by selecti v ely optimizing netlengths for soft error critical nets. T o w ards this, we ha v e de v eloped ne w placement algorithms for radiation immunity of logic circuits using a standard cell based design o w T o the best of our kno wledge, SER reduction using optimizations at the placement stage is attempted for the rst time in this dissertation. 14 PAGE 26 Gate sizing is a simple yet ef fecti v e technique for optimizing circuits for performance, po wer and reliability Figure 2.3 summarizes the v arious metrics used in circuit optimization during gate sizing. T raditionally the gate sizing problem has been formulated as an unconstrained delay minimization problem or as area and po wer minimization problem under delay constraints in [81]. Man y gate sizing formulations ha v e attempted at impro ving po wer area or noise under an acceptable parametric yield. In [83], the optimization uses a penalty function to impro v e the slacks of critical paths to impro v e yield. A stochastic programming approach with chance constraints is used in [76] to incorporate yield in the gate sizing problem formulation. Recently in [72], the joint optimization of po wer and delay under process v ariations has been attempted. The continuous shrinking of noise mar gins due to feature size scaling ha v e made nanometer circuits increasingly vulnerable to reliability issues lik e soft errors and crosstalk noise [103]. In [84], asymmetric logical masking probability of nodes in a logic circuit is e xploited to selecti v ely resize gates. In the pioneering w orks [100, 104], the authors ha v e proposed the use of probabilistic computation using Mark o v Random Fields to pro vide immunity against soft f aults. Flipop selection is used to reduce the impact of soft errors in [30]. In [97], an iterati v e gate sizing algorithm has been proposed to perform coupling noise reduction. In [99], the authors propose a tw o pass method to resize the gates such that the noise constraints are satised without violating the timing constraints. It is pointed out in [56] that the abo v e techniques in v olv e high design o v erhead and lack scalability More importantly despite e xtensi v e research in single noise sources, fe w w orks ha v e focused on de v elopment of analysis and joint optimization techniques for crosstalk noise and soft error ef fects. Ho we v er as we discuss later a deeper inter relationship e xist between radiation induced noise and capaciti v ely coupled noise. All the abo v e techniques, in general, apply to a single noise source and cannot address the e v olving reality of multiple interacting noise sources under process v ariations. The impact of parameter v ariations on performance, po wer and reliability has been increasing with each technology generation. The main causes of the v ariations are either due to en vironmental ef fects lik e changes in po wer supply v oltage and temperature or due to physical ef fects lik e changes in transistor width, channel length, oxide thickness and interconnect dimensions. The uncertainty in the process parameters due to the imprecision of the f abrication process deeply impacts timing, 15 PAGE 27 po wer and noise characteristics of circuits. Thus, identically designed circuits can ha v e huge differences in these characteristics leading to loss in the parametric timing, po wer and noise yield in these circuits. In this dissertation, we in v estigate a challenging problem to address the e v olving reality of multiple interacting noise sources under process v ariations using gate sizing. 2.2 Dissertation Context in Light of P ast W orks As the SER in current nanometer circuits are increasing e xponentially there is an imminent need for no v el algorithms and architectures for synthesizing soft error tolerant circuits in a design o w The correlating impact of soft error mitigation along with the associated o v erhead in a unied manner is not considered in pre vious w orks. This is especially true for current nanometer chips since VLSI systems are increasingly being designed as SystemonChip implementations. An effecti v e approach for mitigation of soft error ef fects for such a system is to implement steps for reliability against soft errors in the design o w itself. Ho we v er such a unied approach for soft er ror mitigation at v arious design abstraction le v els has ne v er been tried before in an y prior research. Addressing soft error issues in such a unied design o w gi v es the system designer the opportunity to weigh up the implications of dedicating more resources for soft error detection and pre v ention against the correlating impact on delay po wer and area. Moreo v er studying the multibit errors in lar ge caches in these systems and strate gies for pre v ention of such multibit soft errors ha v e not been e xplored before in prior research. In this dissertation, we e xplore techniques at all le v els in the design o w to impro v e the vulnerability of VLSI systems against soft errors without compromising on other design metrics lik e delay area and po wer W e propose ne w metrics to model the glitch masking in a circuit using the signal probabilities of the nets. W e le v erage the use of lar ger netlengths as we propose tw o placement algorithms based on simulated annealing and quadratic programming that signicantly reduce the soft error rates of circuits. T o the best of our kno wledge, this is the rst w ork for SER reduction using optimizations at the placement stage. Further we e xplore approaches for hardening of selecti v e nodes within a circuit which signicantly reduces the probability of generation of random radiation induced glitches. The technique achie v es superior reduction in SER at v ery lo w o v erheads 16 PAGE 28 than an y of the w orks listed abo v e. W e de v eloped a reliabilitycentric gate sizing algorithm considering compound noise ef fects under process v ariation using a multimetric optimization frame w ork. Simultaneous optimization of soft errors along with other design metrics under process v ariations is a challenging problem and has not been attempted before. W e also e xplore ef cient architectural solutions that handle multibit errors in hardw are storage structures. Unlik e the pre vious w orks, our architectural techniques tar get multibit errors in lar ge caches and achie v e high SER reduction at minimum area and po wer o v erheads and with virtually no performance penalty The design techniques, algorithms and architectures can be inte grated into e xisting design o ws for VLSI systems using systemonchips. The current w ork is thus unique and lls the v oid in designing reliable and soft error tolerant VLSI systems implemented as system on chips in an unied manner 17 PAGE 29 CHAPTER 3 ESTIMA TION MODELS In this chapter we describe the preliminaries on soft errors and de v elop models for estimation of soft errors in circuits. W e also describe the estimation of timing slack on circuit nodes which we use e xtensi v ely throughout this dissertation. 3.1 Backgr ound The occurrences of random radiation induced ener getic neutron strik es are generally distrib uted f airly uniformly in space and time. The probability of a particle strik e in a circuit node is thus roughly proportional to its acti v e area. The de vice le v el ra w SER can be e xpressed by the follo wing emperical model [27], S E R d e vice FA dKeQ cr i t Q s (3.1) where F is the total neutron ux within the whole ener gy spectrum, A d are dif fusion areas which are sensiti v e to the particle strik es, K is a technology dependent tting parameter Q cr i t is the critical char ge, and Q s is the char ge collection ef cienc y of the de vice. The threshold critical char ge, Q cr i t marks on the onset of the double e xponential current pulse beha vior described abo v e. As the technology scales the Q cr i t char ge required to create a soft error upset is considerably decreased. F or memory elements this glitch v oltage is fed back creating a metastable condition and nally results in a bit ip in the stored information in the memory element. Memories has been traditional victims of radiation induced soft errors. This is due to the dense layouts of memory cells comprising of small transistors leading to lo wer capacitances and v ery less char ge being held to represent a state. Although, with technology scaling the SER in SRAMs has remained constant for a gi v en cache size, 18 PAGE 30 the rate of multibit errors has increased signicantly with the shrinking de vice geometries. Spatial multibit errors occur when a single particle strik e upsets multiple adjacent cells. A higher packing of the cells in the same acti v e area can no w cause a single radiation strik e to af fect multiple cells simultaneously potentially leading to multibit errors. T emporal multibit errors can also occur in lar ge caches when multiple independent particles af fect bits in the same w ord at dif ferent times, primarily due to the lar ger mean lifetime of cache lines in bigger caches. F or combinational logic circuits, a particle strik e on a circuit node can only manifest into a soft error depending on the circuit topology Though the characteristics of a transient pulse at a node depends on the ener gy distrib ution of the incident particle, the dri v e strength of the gate, and the critical char ge, v arious masking f actors determine whether the transient pulse can actually propagate to the primary outputs/latches/ipop s and result in a soft error [84]. The char ge deposition at a particular circuit node is traditionally modeled in simulations by a double e xponential current pulse I int[27], which can be represented as, I intQ t at bet t aet t b(3.2) where Q is the char ge deposited as a result of a particle strik e, t a is the collection timeconstant of the junction, and t b is the iontrack establishment time constant. t a and t b are generally dened by process parameters. Ne xt, we describe the three primary f actors that can potentially mask radiation induced transient current pulses.Lo gical masking occurs when there is no sensitized path from the gate node where the transient pulse occurs to an y of the primary outputs. The transient pulse is ltered when it arri v es to an input of a gate whose an y of the other inputs are at a controlling logic v alue.Electrical masking occurs due to electrical attenuation of the transient pulse in a sensitized path, from its occurrence at a particular gate node to an y of the primary outputs. Thus, the e xtent of electrical masking depends on the electrical property of the gates in the sensitized path. 19 PAGE 31 I7 I6 I5 I4 I3 I2 I1 A G6 G7 G5 G4 G3 G1 G2 0 D 1 1 1 0 1 0 1 Radiation Strike O2 O1 E C B Figure 3.1 Masking in Logic CircuitsT imingwindow masking occurs when the transient pulse does arri v e at the primary outputs with suf cient strength to cause a soft error b ut is suf ciently separated in time from the arri v al of the clock edge. As the latch only samples its input on the clock edge, and as the transient pulse is momentary it does not ef fecti v ely lead to a soft error W e illustrate these masking ef fects using the e xample circuit sho wn in Figure 3.1. As sho wn in the gure, a transient pulse is generated on net B due to a radiation strik e on the acti v e area of gate G2. The transient pulse is logically mask ed at the output of gate G5 as its other input is at its controlling v alue (0). Ho we v er the transient pulse is sensitized through gate G4 and suf fers some electrical attenuation. The pulse is further attenuated through gate G6. If the transient pulse at the primary output is of suf cient strength it may lead to a soft error pro vided it is within a timing windo w of the clock edge, i.e., the pulse must arri v e some time before the clock edge (setup time constraint) and stay till some time after the clock edge (holdtime constraint). These masking ef fects thus mak e the internal circuit nodes to ha v e v arying le v els of susceptibility to soft errors [84]. Thus, the SER of the o v erall circuit is often quite dif ferent from the accumulated de vice le v el SER as gi v en by equation 3.1. The system le v el SER at the architectural le v el can be calculated as, S E R sys t emS E R c k tV ul ner abil i t y (3.3) 20 PAGE 32 V ulnerability depends on the tar get applications of the system and the architectural choices used to implement the system. The vulnerability of cache structures is studied in detail in chapter 7. Soft error rates also depend on en vironmental f actors lik e altitude; ho we v er we do not model this in our formulation. 3.2 Metrics to Estimate Soft Err or Masking Effects The observ ability metric is in v ersely proportional to the masking ef fect of each circuit node. Thus, the nodes with high observ ability ha v e lo wer soft error masking ability than the nodes with lo w observ ability and vicev ersa. 3.2.1 Estimating Logical Masking Effects The glitc h enabling pr obability (GEP) of each net connected to a gate input is dened as the probability that a glitch at the gate input will propagate to the gate output. The GEP of gate input is computed as the product of the probabilities that all other inputs of a gate are at the gate' s enabling v alue. Thus, mathematically the GEP of input i of gate j can be computed as, GE P i j' k e in pu t snjrk i P enabk(3.4) where in pu t sjis the set of all inputs to gate j and P enabkis the probability that input k is at its enabling v alue. The enabling v alue for a gate input depends on the type of gate. F or e xample, for an AND function the enabling v alue is logic 1 and the enabling probability gi v en is, P enabkP sk(3.5) where P skis the signal probability of input k, i.e., the probability that the input k is at logic 1. F or the OR function the enabling v alue is logic 0 and the enabling probability is gi v en by 21 PAGE 33 P enabk1P sk(3.6) Gi v en the signal probabilities of the primary inputs, the signal probabilities of the internal circuit nodes can be calculated [44]. Thus, depending on the function of a particular input and the type of gate itself, the GEP v alues of each gate input can be calculated. Thus, the signal probabilities and the GEP v alues of all internal nodes can be calculated using a forwar d pass through the circuit by visiting circuit nodes in the topologically sorted order from the primary inputs to the primary outputs. It should be noted that in the abo v e formulation we ha v e implicitly ignored the corellation in signal probabilities of the nets [44]. Ho we v er we w ould lik e to mention that calculation of GEP v alues by considering signal correlations mak es the estimation metrics suf ciently compute intensi v e. Thus, in order to reduce the computational comple xity of our optimization schemes using these metrics, we ha v e assumed independence of signal probabilities of nets at the cost of slight inaccurac y [59]. More accurate computation of GEP v alues by considering corellation in signal probabilities, can al w ays be added in our proposed metrics without signicant modication in our formulation. The lo gical observability of each net is dened as the probability that a glitch on that net will propagate to any primary output of the circuit. The computation of the logical observ ability of a net is based on the GEP v alues for gate inputs. The logical observ ability of a net is 1 for primary output nets. Gi v en the logical observ ability of the output net of a gate, the logical observ ability at a input net i of the gate j is gi v en as, Lo gical Obser viGE P i jLo gical Obser vj(3.7) Thus, the logical observ ability of each input of a gate is calculated recursi v ely by multiplying the logical observ ability of its output net with the GEP of the corresponding input net. The logical observ ability of the stem of a f anout node is computed by considering the maximum logical observ22 PAGE 34 0.75 0.75 0.25 0.25 0.25 0.875 0.625 0.8125 O2 O1 0.75(A)0.5 0.5 0.5 0.5 0.5 I1 I2 I3 I4 I5 0.5 0.5 0.5 I6 I7 I8 0.625 0.625 G6 G1 G3 G5 G7 G4 G2 0.156 O2 O1 0.203(B)0.75 0.5 0.25 I1 I2 I3 I4 I5 I6 I7 I8 0.508 0.625 0.875 G2 G6 G3 G1 G5 G7 G4 I6 I5 I4 I3 I2 I1 I7 (C) 1.0 1.0 O1 0.508 0.625 0.875 0.508 0.203 0.04 0.117 0.437 0.437 0.875 0.156 O2 I8 G7 G4 G2 G6 G1 G5 G3 Figure 3.2 Illustrating Computation of Logic Observ ability: (A) Signal Probabilities of Nets, (B) GEP V alues at Internal Gate Inputs, (C) Logical Observ ability V alues at V arious Gate Outputs ability of all its branches. Thus, the logical observ ability of a net is computed using a bac kwar d pass of the structural netlist in the topologically sorted order from the primary output to w ards the primary inputs. The logical observ ability thus obtained is nally normalized by di viding it with the maximum logical observ ability of all nets in the circuit. In Figure 3.2, we illustrate the computation of the logical observ ability for a e xample circuit. The signal probabities of internal nets are computed using the signal probabilities of the primary inputs. So signal probablity of G2 is computed by taking the product of signal probabilities of inputs I3 and I4, which is 0.25. As G3 is a N AND gate, the product of the signal probabilities gi v es the probability of the output at logic 0. Since the probability of the output of gate G3 to be at logic 0 is 0.125, the signal probability of that net has the v alue 0.875. Thus a forw ard pass through the structural netlist in the topologically sorted order pro vides the signal probabilities of all nets and is sho wn in Figure 3.2(A). Since the circuit consists of just N AND and AND the enabling v alue of all gate inputs is logic 1. So for this circuit, the probability of a input to be at its enabling v alue is same as its signal probability The GEP of each gate input can thus be computed by using equation 3.4 and using the pre viously computed signal probabilies. The computed GEP v alues of all gates with internal nets as inputs are sho wn Figure 3.2(B). The logical observ ability v alues for gates with internal nets as their outputs are then computed using a backw ard pass of the structural netlist, using equation 3.7. As pre viously discussed, the logic observ ability for a stem is computed by taking the 23 PAGE 35 maximum of the logical observ ability of the branches. Thus, the logical observ ability of the output of gate G1 is computed by taking the maximum of 0.508 and 0.117 which is 0.508. The computed logical observ ability v alues of all gates with outputs as internal nets is sho wn in Figure 3.2(C). 3.2.2 Estimating Electrical Masking Effects The strength of electrical masking for a particular gate can be estimated by creating noise rejection curv es (NRC) for that gate type. The NRC for an in v erter cell with gate length of 180 nm, reported in [59], is sho wn in Figure 3.3. The xaxis denotes the input noise width and the yaxis denotes the input height. All radiation induced SET which are belo w the NRC curv e are noiseimmune. In other w ords, either the y ha v e a width belo w the corresponding NRC or has height to the left of the NRC. Radiation induced SET which are abo v e the NRC are noisesensiti v e. Thus, the area under the NRC di vided by the area o v er the NRC corresponds to the electrical masking of a gate. It should be noted that, for a particular noise pulse with gi v en width and height the electrical masking is higher for a gate with higher f anout load. W e estimate the electrical observability (which has a in v erse relation to electrical masking) of a gate i at its output net as follo ws, E l ec t r ical Obser viWi C Li(3.8) where Wiis the size of gate i and C Liis the capaciti v e load at node i. In general, the NRC curv e can be e xpressed analytically as well [60] and the in v erse relationship to f anout load can be sho wn mathematically as well under some simplifying assumptions. The maximum electrical observ abillity in the circuit is used to normalize the node electrical observ ability 3.2.3 Estimating T iming W indo w Masking Effects W e determine a pessimistic estimate of a timing windo w such that noise e xisting in that timing windo w (TW) will reach the primary outputs and get latched in the output ops. W e estimate the TW observability at each node by computing the dif ference between the maximum and the minimum 24 PAGE 36 Figure 3.3 NRC for an In v erter at V arying Capaciti v e Loads [59] delay from that node to an y primary output. Thus, the TW observ ability of a gate i at its output net can be e xpressed mathematically as follo ws, T wObser vimax j e POP a t hT oPOi j min j e POP a t hT oPOi j(3.9) where P a t hT oPOi jis the path delay from an y primary output j to the node i and PO is the set of primary outputs of the circuit. TW observ ability can be computed recursi v ely by computing the maximum and minimum P a t hT oPOi jfrom the sink(primary outputs) to source(primary inputs) while visiting nodes in the re v erse topological order The maximum and minimum P a t hT oPOi jat the gate outputs connected directly to the primary outputs are set to 0. The gate delay is added while going from the gate output to the gate input. When a stem is encountered, the maximum(minimum) of the max(min) of P a t hT oPOi jat the branches is computed. Thus, this pessimistic metric assigns higher v alues of TW observ ability to nodes which ha v e dif ferent path delays to the primary outputs. Intuiti v ely this mak es sense, as the radiation induced noise pulse can occur in a wider time windo w and still get captured in the output ops making the node more vulnerable. This metric is also normalized by di viding the TW observ ability at a gate output by the maximum TW observ ability found in the circuit. 25 PAGE 37 Finally a cumulative pr obability of observability (CPO) for each gate output is computed which captures all the three masking ef fects cumulati v ely The CPO of gate i at its output can thus be e xpressed as, C POiLo gical Obser viE l ec t r ical Obser viT wObser vi(3.10) It should be noted, that while the logical observ ability has higher v alues for gates near the primary outputs, the TW observ ability is quite less. The internal gates which are f arther a w ay from the primary outputs ha v e more unbalanced delay paths to the primary outputs and hence ha v e higher v alues of TW observ ability Ho we v er for these nodes, the logical observ ability is quite less. 3.2.4 Estimation of T iming Slack A combinational circuit without feedback can be modeled as a directed ac yclic graph (D A G). The D A G can be made polar by assigning a dummy source node connected to all primary inputs and a sink node connected to all primary outputs. The earliest arri v al time (EA T) of each net can no w be computed by tra v ersing the D A G in the topologically sorted order from the source and assigning the EA T of a gate output as the maximum of the EA Ts of its inputs plus the delay of the gate. Similarly the latest arri v al time (LA T) of each net can be computed by tra v ersing the D A G in the topologically sorted order from the sink and assigning the LA T of a gate input as the minimum of the LA Ts of its outputs minus the delay of the gate. The dif ference of the LA T and the EA T pro vides the slack for each net. 26 PAGE 38 CHAPTER 4 RADIA TION IMMUNITY A T PHYSICAL DESIGN LEVEL The rates of radiation induced soft errors ha v e been signicantly increasing due to the aggressi v e scaling trends in the nanometer re gime. Se v eral circuit optimization techniques ha v e been proposed in literature for pre v enting such transient f aults in logic circuits. These include, the inclusion of concurrent error detection circuits on selecti v e nodes, selecti v e gate sizing, dualVDD assignment. As described in Chapter 5, selecti v e node hardening at the transistor le v el can also be used. In this chapter we sho w that transient glitches due to radiation strik es can be suf ciently reduced by intelligently modifying the placement stage in cell based designs to selecti v ely assign lar ger wirelengths to certain critical nets. Lar ger netlengths can pro vide lar ger RC ladders to ef fecti v ely lter out the transient glitches. T o w ards this, we propose tw o placement algorithms based on (i) simulated annealing and (ii) quadratic programming that signicantly reduce the soft error rates of logic circuits. The soft error masking ef fects are captured by using a ne w metric called the cumulative pr obability of observability (CPO) of each net which is dened as the probability that a transient glitch at the net will result in a soft error for the logic circuit. The cost function for simulated annealing (SA) is modeled as the summation of the masking probability weighted with the netlength for each net, while simultaneously constraining the total area and the total wirelength. The quadratic programming based placement algorithm for radiation immunity pro vides a more computationally ef cient alternati v e for soft error reduction during placement. Both the algorithms try to assign higher wirelengths for nets with lo w masking probability for higher glitch reduction, while maintaining lo w delay and area penalty for the o v erall circuit. T o the best of our kno wledge, the reduction of soft error rate during placement is being attempted for the rst time. The proposed algorithms were implemented using the FreePDK 45nm Process Design Kit and the OSU standard cell library and tested on the ISCAS85 benchmark circuits. Experimental results indicate that the 27 PAGE 39 C C C R R R V(n1) V(n2) V(n3) V(n4) VDD Fanout Load (CL) C R Input Figure 4.1 Interconnects Modeled as RC Ladder proposed algorithms signicantly impro v e the radiation immunity in logic circuits without much delay and area o v erheads. The rest of the chapter is or ganized as follo ws. In Section 6.1, we e xplain ho w interconnect length can be an ef fecti v e w ay to reduce the propagation of transients due to radiation. Section 6.2 describes the SA based placement algorithm to reduce circuit SER. Section 6.3 pro vides a f aster alternati v e implementation of radiation immune placement using quadratic programming. Finally Section 6.4 describes our e xperimental setup and illustrates the results. 4.1 Glitch Filtering in Inter connects In this section, we sho w ho w the interconnect length can be le v eraged to lter out glitches resulting from random radiation strik es. Let us consider the case in which a simple in v erter cell is dri ving a x ed f anout load. The wire connecting the dri ving in v erter cell to the dri v en f anout load can be approximated as a RC ladder as sho wn in Figure 4.1. The RC ladder is modeled using four resistance/capacitan ce elements in series with each block to the right, thus modeling the increasing interconnect length. F or e xample, the ef fecti v e RC ladder at node n1 models 1X times the interconnect length and the ef fecti v e RC ladder at node n2 models 2X times the interconnect length. Similarly the ef fecti v e RC ladders at nodes n3 and n4 model interconnect lengths of 3X and 4X respecti v ely 28 PAGE 40 Symbol Wave D0:tr0:v(n1) D0:tr0:v(n2) D0:tr0:v(n3) D0:tr0:v(n4) Voltages (lin) 0 50m 100m 150m 200m 250m 300m 350m 400m 450m 500m 550m 600m 650m Time (lin) (TIME) 0 1u 2u 3u 4u 5u 6u 7u # generated for: hspices. Figure 4.2 Ef fect of W irelengths on Glitch Reduction29 PAGE 41 A radiation strik e on a cell can be modeled as a transient current source. W e modeled radiation strik es of deposited char ges in the range of [60fC, 135fC] with current sources as dened in equation 1 with a t a of 10ps and t b of 5ps. The range of deposited char ges were considered based on typical radiation ux at the seale v el [29]. W e e xperimented with v arying the interconnect lengths for the circuit in Figure 4.1 in SPICE to estimate the reduction in glitches due to random radiation strik es. W e used the FreePDK 45nm technology kit and the unit resistance and capacitance a v ailable from the technology library The results of our e xperiments are sho wn in Figure 4.2. As sho wn in the gure, greater interconnect lengths acts as a higher order RC ladder and has ef fecti v ely higher lo w pass ltering capacities. The transient response also shifts progressi v ely to w ards the right due to the higher RC delay incurred. W e note that the glitch reduction is more pronounced when interconnect length is increased from 1X to 2X, or 2X to 3X than for 3X to 4X. Thus, higher interconnect lengths be yond a certain threshold does not greatly reduce the circuit SER b ut does w orsen the circuit delay Also, as discussed in the pre vious section, dif ferent signal nets ha v e dif ferent CPO v alues and therefore, increasing the interconnect lengths for nets ha ving lo w CPO only reduces the circuit performance b ut does not ef fecti v ely reduce the circuit SER. Based on this observ ation, in the subsequent sections, we describe placement algorithms that selecti v ely increases interconnect length for soft error critical nets while not impacting the delay critical nets. 4.2 Placement f or Radiation Immunity using Simulated Annealing All block placement algorithms which are based on sequence pairs use SA and ef cient algorithms ha v e been proposed in the literature to compute the unique placement conguration from nite sequence pairs by computing the longest common subsequence [54] of the sequence pair In this section, we describe ho w the SA based placement algorithm can be used to generate a radiation immune placement of standard cells while simultaneously constraining the total area and wirelength. Gi v en a sequence pair a v ertical and horizontal constraint graph can be obtained by applying theleftof, rightof, abo v eof, belo w ofrelations on the sequence pair [49]. The weighted constraint graphs can be made polar by assigning a dummy source and sink node. The longest path from the source to an y of the nodes in the horizontal and v ertical constraint graph gi v es, respecti v ely 30 PAGE 42 the horizontal and v ertical coordinates of the block in the corresponding placement. The longest path from the source to the sink gi v es the width and height of the placement and hence the resulting area. The wirelength (WL) of a signal net can be computed by taking semiperimeter of the bounding box enclosing the blocks that are connected by the net. W e note that the wirelength can also be estimated more accurately using the spanning tree method or the steiner tree method. Ho we v er the semiperimeter metric is f ast and is the closest approximate to the most accurate steiner tree method, especially for nets with smaller f anout. Simulated annealing is used in placement as an iterati v e impro v ement algorithm. Each placement conguration is represented as a sequence pair and mo v es in the space of sequence pairs are probabilistically accepted depending upon the cost gradient and the current temperature. Gi v en an initial sequence pair we ha v e allo wed three types of mo v es: (a) e xchange of 2 block positions in the rst sequence alone and (b) e xchange of 2 block positions in both the sequences. Each placement conguration is represented as a sequence pair and the mo v es in the space of sequence pairs are probabilistically accepted depending upon the cost gradient and the iteration count. Higher cost mo v es ha v e a higher probability of acceptance at initial iterations for better state space e xploration, while at later iterations the algorithm greedily tries to minimize the cost. T raditionally the cost function is simply the bounding box area or the total wirelength. W e ha v e modied the cost function in a manner that it maximizes the CPO weighted wirelength while simultaneously constraining the total area and the total wirelength. At higher temperatures, the radiation immune placement algorithm that we ha v e de v eloped minimizes the follo wing cost function, C os t F unc 105 Ar ea T o t al05 ARE A OPT W L OPT W L T o t al (4.1) where ARE A OPT is the optimal cost obtained by the placement algorithm for minimizing just the bounding box area, while W L OPT is the optimal cost obtained by the placement algorithm for minimizing just the total wirelength, and Ar ea T o t al and W L T o t al is the bounding box area and total wirelength for the current placement conguration. Thus initially the cost function is just a normal31 PAGE 43 300 200 100 0 100 200 300 400 0 200 400 600 800 1000 0 1000 2000 3000 4000 5000 300 200 100 0 100 200 300 400 Cost Function Total Wirelength AreaFigure 4.3 Cost Function with Penalty for High Area and T otal W irelength. (Note: All v alues are in generic units) ized and weighted combination of area and wirelength. At lo wer temperatures, the cost function is changed as follo ws, C os t F unc 2 i e N e t s C PO i W L ienAr ea T o t al ARE A OPTMenW L T o t al W L OPTN (4.2) where C PO i and W L i are the CPO v alue and wirelength for the net i in the current placement conguration, while N e t s is the set of all signal nets in the circuit. M and N are user dened constants controlling the penalty for high area and wirelength and are set to v alues of 50 and 140 respecti v ely after e xtensi v e e xperimentation. As sho wn in gure 4.3, the cost increases e xponentially if the area and total wirelength is too high compared to the optimal area and total wirelength costs obtained during placement for just area or wirelength. Minimizing the CPO weighted wirelength selecti v ely impro v es the SER for soft error critical nets while not af fecting the delay critical nets. The o v erall SA based radiation immune placement algorithm is sho wn in Algorithm 5. CostFunc1 and CostFunc2 are the same as dened in equations 4.1 and 4.2. The algorithm assumes that 32 PAGE 44 Algorithm 1 Radiation Immune Placement Using SA temp=INIT TEMP; place=INIT PLA CEMENT ; AnnealT ime=INIT ANNEALTIME; while tempFREEZINGTEMP do CurAnnealT ime = 0; while CurAnnealT imeAnnealT ime do ne wplace = PER TURB(place); if tempLO W TEMPTHRESHOLD then D C = CostFunc1(ne wplace) CostFunc1(place); else D C = CostFunc2(ne wplace) CostFunc2(place); end if if D C0 then place = ne wplace; else if (RANDOM(0,1)e D C t em p ) then place = ne wplace; end if CurAnnealT ime = CurAnnealT ime + 1; end while AnnealT ime = AnnealT imeANNEALRA TE; temp = tempCOOLINGRA TE; end while 33 PAGE 45 OPT ARE A and OPT W L has been pre viously obtained by using the cost function of just the area or just the total wirelength. The pin assignment to I/O terminals is done by rst di viding the placement area into three horizontal re gions. If most position centers of the blocks connected to the I/O pin lied in the topmost horizontal re gion, the I/O pin w as assigned a position at the center of the top edge of the placement boundary Similarly if most block centers lied in the bottommost horizontal re gion, the I/O pin w as assigned a position at the center of the bottom edge of the placement. Otherwise, the middle horizontal re gion w as further di vided into left and right re gions. If most block centers lied in the left v ertical re gion, the I/O pin w as assigned a position at the center of the left edge; otherwise the I/O pin is assigned a position at the center of the right edge. 4.3 F ast SER A war e Placement using Quadratic Pr ogramming As sho wn in the pre vious section, SA based placement can be tuned to SER optimization by appropriately choosing the cost function used. Ho we v er SA based placement has lar ge computational o v erhead especially that in v olving optimization of wirelength or CPO weighted wirelength. The e x ecution time can be slightly impro v ed using the notion of oorplan slacks as proposed in [62]. Thus, while SA based approach pro vides good results in terms of SER reduction, due to its lar ge runtimes we in v estigate a computationally ef cient placement algorithm for SER optimization based on Quadratic Programming (QP). The objecti v e function for the placement problem can be formulated as a weighted sum of the squared distance among the connected cells and can be e xpressed as, fxy1 2 i e N j e N w i j c i jx ix j2 y iy j2(4.3) where N denotes the set of modules to be placed, c i j denotes the connecti vity between modules i and j x i and y i denotes the coordinates of the center of module i while w i j denote the user dened weights for connection between modules i and j T raditionally for timing dri v en placement, the weights w i j are chosen to be a function of the criticality of the corresponding net joining modules i and j T o incorporate a SER dri v en placement scheme we modied the weight function as, 34 PAGE 46 w i j w 1 1sl ac k i j w 2 1C PO i j (4.4) where C PO i j is the CPO of the net connecting modules i and j while w i j is the corresponding timing slack at that net. w 1 and w 2 are user dened constants in the range [0, 1] controlling the relati v e weighting for delay and SER optimization respecti v ely The objecti v e function fx ygi v en in equation 4.3 is a separable function and can be rewritten as, fxy!fx fy(4.5) which mak es analysis of fxand fyidentical. The function fxcan be e xpressed in a compact matrix form as, fx1 2 x"Qxd"xcons t an t (4.6) where x is a v ector denoting the x coordinates of cell locations, Q is the positi v e denite connecti vity matrix weighted with SER and delay metrics as in equation 4.4, while v ector d originates from the contrib ution of the I/O pad cells which can be treated as x ed modules. The allo w able placement re gions for a set of modules are updated after each iteration of bipar titioning. The centers of the placement re gions at the t t h le v el of partitioning is gi v en by Antxbnt(4.7) where the v ector b#t$denote the center coordinates of the placement re gions at the corresponding iteration step t and the entries a cr of matrix A#t$are computed as follo ws, 35 PAGE 47 a crar ea c c ar ea ci f c e R r0o t her wisewhere ar ea c is the area of a cell c and R t r is a partitioned re gion r of the placement re gion R in the partition iteration t. Since, by construction, matrix Q is positi v e denite and the constraints in the form equation 4.7 is linear the o v erall placement problem is a quadratic optimization problem which is con v e x and has a unique global minimum fx #. Algorithm 2 Radiation Immune Placement Using QP R = whole chip area. Compute the CPO of all nets. Compute the slack of all nets. Compute the weight function for all nets (equation 4.4). Compute the Q matrix using the weighted connecti vity matrix. Solv e the initial unconstrained QP for x. Solv e the initial unconstrained QP for y while each cell is not assigned a re gion do Alternate between sorting cells using x or y coordinates. Bipartition the placement re gion R into R t using sorted coordinates. Construct the constraint matrix A and v ector b using the bipartition. Solv e CQOP for x. Solv e CQOP for y end while Le galize the nal placement. The o v erall algorithm for QP based placement is gi v en in Algorithm 2. The algorithm progresses by alternating global optimization and partitioning phases. Ho we v er unlik e other partitioning based placers the algorithm maintains simultaneity accross all optimization steps [63]. The QP based placement scheme is based upon solving a series of constrained quadratic optimization problems (CQOP). The algorithm initially solv es the global optimization problem by imposing one constraint on all modules, forcing the centroid of the cells to the chip center The solution of this pro vides the initial spatial coordinates of the cell locations. These spatial coordinates are then sorted based on x or y coordinates and the placement re gion is partitioned into tw o re gions. 36 PAGE 48 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: SER Optimized, w1=0.1 w2=0.9U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: Delay Optimized, w1=0.9 w2=0.1U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 Figure 4.4 Placement of c432 Benchmark Using QP (A) w1=0.1 w2=0.9, (B) w1=0.9 w2=0.137 PAGE 49 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: SER Optimized, w1=0.01 w2=0.99U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: Delay Optimized, w1=0.99 w2=0.01U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 Figure 4.5 Placement of c432 Benchmark Using QP (A) w1=0.01 w2=0.99, (B) w1=0.99 w2=0.0138 PAGE 50 W e performed the bipartitioning by alternating between sorting based on x coordinates and sorting based on y coordinates on each iteration. The x and y coordinates obtained after solving the CQOP in the pre vious step is used to do the bipartitioning of the ne xt step. The cells belonging to each of the tw o re gions are used to compute the centroids of the tw o ne w re gions and the centers of these tw o ne w re gions are then used to impose ne w constraints. Subsequently the ne xt global optimization step is performed by solving a CQOP with these ne w constraints for all re gions. This alternati v e global optimization and partitioning step is carried out until each cell is assigned to its o wn re gion. As the CQOP formulation for placement considers all cells as point masses, a nal le galization step is necessary to remo v e minor o v erlaps. 4.4 Experimental Results The proposed algorithms were implemented on 1.5Ghz UltraSparc processor with 4GB of memory and running SunOS 5.8. The results were v alidated using the ISCAS'85 benchmark circuits using the FreePDK 45nm technology kit and the OSU standard cell library b uilt on it [58]. Dimensions of each cell w as e xtracted from the DEF le of the standard cell library Synopsys Design Compiler w as used to do the initial technology mapping and for computing the enabling probability of the nets. The CPO for each net w as calculated by a separate C script using the method discussed in chapter 3. The technology mapped structural netlist is con v erted into the GSRC bookshelf format using a con v erter C script. Man y soft error estimation tools ha v e been reported in literature [26, 29, 33]. The SEA T LA tool [29] models the entire spectrum of neutron strik es (from char ge v alues in the [10fC,150fC] range) and quite close in accurac y to actual SPICE simulations. W e e xtended this pathbased tool to a nodebased formulation as described [56] for our SER estimation. The v arious controlling parameters used for the SA based placement algorithm are summarized in T able 4.1. The proposed algorithm reads the blocks and the netlist in the GSRC format. In Figure 4.6, we plot the absolute areas for SER optimized placement, area optimized placement and wirelength optimized placement. W e also plot the absolute wirelengths for the three placement schemes in Figure 4.7. As sho wn, the total area and the wirelength were increased mar ginally 39 PAGE 51 T able 4.1 Simulated Annealing P arameters P arameter Name % V alue INIT TEMP 5 million FREEZINGTEMP 0.1 INIT ANNEALTIME 100 INIT PLA CEMENT Identity permutation on sequence pair LO W TEMPTHRESHOLD 4.0 COOLINGRA TE 90% ANNEALRA TE 2% T able 4.2 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.99 and 0.01 Benchmark % SER reduction % Delay Ov erhead %WL Ov erhead c432 53.79 25.09 19.02 c499 30.54 3.47 0.83 c880 37.88 28.22 16.21 c1355 41.97 4.61 1.45 c1908 27.83 8.34 3.96 c2670 101.0 4.53 1.74 c3540 77.82 3.57 3.66 c5315 37.65 12.63 10.01 c6288 57.46 15.85 7.34 c7552 11.92 0.1 0.24 A V G 47.79 10.62 5.21 40 PAGE 52 Figure 4.6 Area Comparison for Dif ferent Placement Schemes. (Note: All v alues are in generic units) Figure 4.7 W irelength Comparison for Dif ferent Placement Schemes. (Note: All v alues are in generic units) for the SER optimized placement scheme compared to area optimized and wirelength optimized placements respecti v ely In T able 4.3, we illustrate the results of SER reduction on ISCAS85 benchmarks for our radiation immune SA based placement for SER optimization. As sho wn in the table, our SER optimized placement reduces the SER by 27.12% while incurring an area o v erhead of only 18.86% compared to area optimized placement. The radiation immune placement scheme reduces the SER 72.29% compared to placement with a delay o v erhead of only 9.26% compared wirelength optimized placement. F or some benchmarks, lik e the c1908 benchmark, a SER reduction as high as 95% w as achie v ed. As sho wn, the radiation immune SA based placement scheme in general selecti v ely in41 PAGE 53 creases the netlengths of soft error critical nets while k eeping the total area and total wirelength under check. The QP based placement formulation w as implemented in C. W e used GNU scientic library [64] for solving the initial unconstrained QP problem. The CQOP problems are solv ed using a quadratic programming solv er package [65]. The QP based placement formulation requires pinassignment for IO pins which w as done by doing a initial standard cell placement using Cadence SoC Encounter The DEF le with the pinlocations are con v erted to GSRC format using Capo [66]. The radiation immune QP placer reads the GSRC format les containing the pin locations and outputs the placed coordinates for all mo v able cells again in GSRC format. Capo is used to le galize this placement solution and create the nal placement. W e performed timing dri v en QP based placement by using higher weights for w 1 and lo wer weights for w 2 in equation 4.4. W eights are assigned v alues in the range [0 1]. F or placement sk e wed to w ards SER optimization we used higher weights for w 2 and lo wer weights for w 1. A weight of 10 for w 1 and 00 for w 2 or vice v ersa mak es the matrix Q in equation 4.6 singular So we achie v ed timing dri v en placement in the limit by pro viding weights of 0.9, 0.99 and 0.9999 to w 1 and 0.1, 0.01, and 0.0001 for w 2. Similarly we approached SER a w are placement by pro viding weights of 0.9, 0.99 and 0.9999 to w 2 and 0.1, 0.01, and 0.0001 for w 1. Figures 4.4, 4.5 and 4.8 sho w placements of c432 benchmark for v arious v alues of w 1 and w 2. 42 PAGE 54 T able 4.3 SER Reduction for Radiation Immune SA Based Placement with Associated Delay and Area Ov erhead Benchmark Comparison of Radiation Immune Placement with Area optimized Placement W irelength optimized placement % SER reduction % Area Ov erhead % SER reduction % Delay Ov erhead c17 20.93 5.26 71.26 2.0 c432 20.83 23.83 67.70 10.72 c499 23.89 30.77 64.28 9.45 c880 13.12 14.57 69.75 10.13 c1355 30.09 31.01 76.15 9.66 c1908 29.81 25.27 88.43 12.07 c2670 49.36 19.59 94.72 9.91 c3540 20.39 11.17 59.79 10.91 c5315 35.90 8.26 62.0 9.7 c6288 20.79 3.70 67.67 9.2 c7552 33.26 11.33 73.53 8.14 A V G 27.12 18.86 72.29 9.26 43 PAGE 55 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: SER optimized, w1=0.0001 w2=0.9999U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 c432: Delay Optimized, w1=0.9999 w2=0.0001U1 U10 U100 U101 U102 U103 U104 U105 U106 U107 U108 U109 U11 U110 U111 U112 U113 U114 U115 U116 U117 U118 U119 U12 U120 U121 U122 U123 U124 U125 U126 U127 U128 U129 U13 U130 U131 U132 U133 U134 U135 U136 U14 U15 U16 U17 U18 U19 U2 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U3 U30 U31 U32 U33 U34 U35 U36 U37 U38 U39 U4 U40 U41 U42 U43 U44 U45 U46 U47 U48 U49 U5 U50 U51 U52 U53 U54 U55 U56 U57 U58 U59 U6 U60 U61 U62 U63 U64 U65 U66 U67 U68 U69 U7 U70 U71 U72 U73 U74 U75 U76 U77 U78 U79 U8 U80 U81 U82 U83 U84 U85 U86 U87 U88 U89 U9 U90 U91 U92 U93 U94 U95 U96 U97 U98 U99 Figure 4.8 Placement of c432 Benchmark Using QP (A) w1=0.0001 w2=0.9999, (B) w1=0.9999 w2=0.000144 PAGE 56 W e computed the SER reduction and delay o v erhead for SER a w are placement as the relati v e increase in delay and decrease in SER with that of the timing dri v en QP based placement with similiar weight combination. In T ables 4.4, 4.2 and 4.5, we summarize the results of SER reduction and corresponding delay and wirelength o v erhead on ISCAS85 benchmarks for these weight combinations. F or e xample, T able 4.4 compares SER reduction, delay and WL o v erhead by assigning w 1 of 0.9 and w 2 of 0.1 for timing dri v en placement and by assigning w 1 of 0.1 and w 2 of 0.9 for SER a w are placement. Similarly in table 4.2 we compare a weight combination of w 1 of 0.99, w 2 of 0.01 and w 1 of 0.01, w 2 of 0.99, while in table 4.5 we compare a weight combination of w 1 of 0.9999, w 2 of 0.0001 and w 1 of 0.0001, w 2 of 0.9999. As sho wn, the a v erage SER reduction saturates to about 48% for more and sk e wed weight combinations while delay and WL o v erhead is about 11% and 5% respecti v ely Lar ge reductions in SER can be achie v ed for c2670 and c3540. This can be e xplained by the lar ge spread of CPO v alues in these benchmarks. c7552 sho ws smaller reduction in SER as the distrib ution of CPO v alues in this benchmark is v ery tight. Ov erall, the radiation immune QP based placement scheme consistently reduces SER by selecti v ely increasing the netlengths of soft error critical nets while k eeping the delay and total wirelength under check. It should be noted, that although increasing interconnect lengths for soft error critical nets with timing slack does not w orsen circuit performance, o v erall interconnect po wer may increase. W e w ould lik e to mention that, although po wer has not been considered in both of our placement schemes, it can easily be incorporated into our placement frame w orks by suitably changing the cost function used during optimization. W e compared the proposed simulated annealing based radiation immune placement with the quadratic programming based radiation immune placement. Ov erall, we sa w a loss of SER reduction by about 1314% in the QP based scheme compared to that of the SA based radiation immune placement. Ho we v er there w as orders of magnitude dif ference in runtime between SA based radiation immune placement and QP based radiation immune placement scheme. In gure 4.9, we plot the speedup in the runtimes of QP based placement for radiation immunity compared the SA based placement for radiation immunity The e xperiments were performed with v aried number of cells in the design by using a subset of the ISCAS85 benchmarks and certain lar ge benchmarks from the 45 PAGE 57 T able 4.4 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.9 and 0.1 Benchmark % SER reduction % Delay Ov erhead %WL Ov erhead c432 38.41 18.09 14.31 c499 25.03 2.91 0.39 c880 24.17 19.0 11.37 c1355 31.30 4.29 0.0 c1908 27.84 9.57 5.92 c2670 78.16 4.32 2.27 c3540 59.65 2.83 0.90 c5315 25.76 8.73 7.35 c6288 48.11 12.38 5.61 c7552 10.21 1.06 0.34 A V G 36.00 6.90 3.32 OpenSparc T1 designs [61]. As sho wn in the gure, on a v erage there can be a speedup of more than 100X. Thus, QP based radiation immune placement pro vides a nice compromise in solution quality with much superior runtime. The SA based radiation immune placement pro vides better SER reduction, b ut for lar ge designs the scheme is prohibiti v e in terms of runtime. The primary reason for superior runtimes of QP based placers is due to the f act that iterati v e solution methods used for solving the CQOP e xploits the sparsity of the Q matrix ef ciently Furthermore, with the increasing number of constraints in the CQOP the solutions of the pre vious iteration can be used to guide solution of the ne xt iteration [63]. Therefore, the number of iterations required to solv e the CQOP decreases rapidly It should also be noted that the QP based placement scheme could easily be modied into a force directed placement scheme. Ho we v er we found that such force directed placers only impro v ed e x ecution time mar ginally while producing a lot of cell o v erlaps putting a lot of pressure on the placement le galizer 46 PAGE 58 T able 4.5 Comparison of SER A w are QP Based Placement with T iming Dri v en Placement: W eight Combination 0.9999 and 0.0001 Benchmark % SER reduction % Delay Ov erhead %WL Ov erhead c432 54.35 25.12 18.40 c499 30.71 3.18 1.23 c880 38.75 27.67 12.95 c1355 42.87 4.57 1.71 c1908 23.83 7.76 3.27 c2670 103.1 4.95 2.08 c3540 75.41 2.93 4.21 c5315 43.43 21.09 19.39 c6288 55.73 12.56 2.54 c7552 11.84 0.0 0.21 A V G 48.01 10.91 5.12 Figure 4.9 Speedup Comparison of QP Based and SA Based Radiation Immune Placement Schemes 47 PAGE 59 CHAPTER 5 SOFT ERR OR MITIGA TION A T CIRCUIT LEVEL A soft error may manifest itself as a bit ip in a memory element or it can occur in an y internal node of a combinational logic and subsequently propagate to and be captured in a latch. In the past, soft errors ha v e been handled at the circuit le v el using schmitt triggers, adding duplicate cells and clamping the v oltage between the duplicate nodes, and addition of pass transistors to lter the random glitches that lead to soft errors. Ho we v er these approaches for a v oiding soft errors in logic circuits often incur signicant o v erheads in terms of delay area and po wer In this chapter we propose a no v el circuit which can be inserted at the output of a logic cell to pre v ent the generation of transient glitches due to radiation strik es. The circuit is based on a RC dif ferentiator to detect the occurence of such transient glitches. The lar ge v oltage swing accross the resistance of the dif ferentiator is used to control the gate to body v olatge of enhancement type NMOS and PMOS gates placed in series. Thus, the v ery characteristic of transient pulses is e xploited to cut of f the cell hit by the strik e from the dri v en cell. Experimental results indicate that the insertion of these radiation blocking cells on the gate output nodes can signicantly reduce the generation of transient glitches. Ho we v er blind insertion of these cells on circuit nodes can incur delay and area penalties. Based on this observ ation, we propose an algorithm to insert the cells only on selected nodes in a logic circuit. The algorithm is based on ranking the circuit nodes based on a ne w metric called the Pr obability of Radiationbloc k er cir cuit Insertion (PRI) and inserting the radiation blocking cells only on the top fe w nodes in the sorted list of PRI v alues. The PRI metric is computed by taking a weighted combination of the glitch masking ef fect on a circuit node and the slack a v ailable at that particular node. Thus, the algorithm inserts radiation blocking cells selecti v ely on soft error vulnerable nodes for the noncritical paths of a circuit. W e e xperimented with the frame w ork using the NSCU 45nm Process Design Kit and the Nangate standard cell library 48 PAGE 60 Cell Standard M2 Driven Driving Standard Cell M1 vdd Depletion Mode MOS M4 M5 node n2 Vr_bar Vr Vr_bar vdd Vr M3 Devices Vr Vr_bar node n1 node n3 Figure 5.1 Schematic of Radiation Induced Glitch Block er Circuit Experimental results on ISCAS85 benchmark circuits indicate that logic circuits optimized with selecti v e insertion of these radiation blocking cells can signicantly reduce soft error rates with mar ginal o v erheads in terms of delay and area. The rest of the chapter is or ganized as follo ws. In Section II, we present the transistor le v el circuit that can reduce the propagation of transients that are generated due to radiation. In Section III, we describe an algorithm to insert the glitch blocking cells on selected circuit nodes to ha v e v ery lo w costs on delay and area costs. Section IV describes the e xperimental setup and presents the results. Finally we compare with some related w orks in Section V 5.1 Radiation Induced Glitch Block er Cir cuit In this section, we describe the proposed circuit le v el technique for countering transient f aults in a standard cell based design o w The technique is based on a RC dif ferentiator circuit to counter transient glitches due to radiation strik es occurring on the acti v e area of the cells. The circuitry which is attached to a standard cell output, will be referred to as the radiation block er circuit throughout the paper The transistor le v el schematic of the radiation block er circuit is sho wn in Figure 5.1. As sho wn in the gure, the circuit consists of a RC dif ferentiator implemented with MOS transistors M1, M2 and M3. A small, al w ays on, NMOS transistor M1 pro vides enough resitance to obtain a lar ge v oltage swing during radiation strik es. The small resistance can also be implemented using a simple 49 PAGE 61 Figure 5.2 Plotting the V oltages across M1 poly strip eliminating the need for M1. The NMOS and PMOS transisors M2 and M3 pro vide a good constant capacitance v alue across a v oltage range. The use of this conguration is moti v ated by the idea that the current that o ws through M1 is proportional to the change in v oltage across the node n1. In particular the current o wing through M1, acting as a resistor is gi v en by ItC e f f dV n 1 d t (5.1) where C e f f is the ef fecti v e capacitance due to the NMOS and PMOS transistors M2 and M3. The v oltage swing accross M1 is proportional to this current. As sho wn in the gure, we denote the v oltage, with respect to ground, on the drain and source of M1 as Vr and Vr bar respecti v ely During a radiation strik e the v oltage accross M1 is v ery high. This v oltage is used to control the gatebody v oltage of the depletion type NMOS and PMOS transistors M4 and M5. As M4 and M5, are depletion mode de vices the y are normally on and a ne gati v e v oltage has be applied to mak e them go into cutof f. During a re gular logic transition, the v oltage on node n1 changes in a comparati v ely slo wer ramp and therefore the v oltage accross M1 is small. Thus, during a re gular logic transition the v oltage accross M1 is not enough to cut of f M4 or M5. Ho we v er during a single e v ent transient due to a radiation strik e, the change in v oltage of node n1 is e xponential which leads to a lar ge v oltage drop across M1. This v oltage drop is suf cient to 50 PAGE 62 Figure 5.3 (A) T ransient Pulses on In v erter Cell for Radiation Strik es of V arying strength, (B) Cor responding Results on an In v erter Cell Protected with Radiation Block er Circuit cutof f depletion mode transistors M4 or M5. In Figure 5.2, we plot the transient node v oltages denoted by Vr and Vr bar It should be noted that during the rising phase of the radiation induced transient pulse, the v oltage dif ference swing is positi v e and the depletion mode PMOS de vice M5 cuts of f while if the v oltage swing is ne gati v e and the depletion mode NMOS de vice M4 cuts of f. It should also be noted that a small positi v e or ne gati v e v oltage appears accross M1 during re gular logic transitions as well. Since the magnitude of this v oltage is small due to comparati v ely slo wer changing v oltage ramp of the output node during re gular logic transitions, it does not cut of f the depletion mode transistors M5 or M4 respecti v ely W e e xperimented with the radiation block er circuit using the FreePDK 45nm Process Design Kit on a simple in v erter cell hit by radiation strik e. Figure 5.3(A) illustrates the transient glitches generated due to radiation strik es of v arying strength on the in v erter cell. Figure 5.3(B) sho ws the corresponding results for in v erter cell with radiation block er circuitry which sho ws signicant reduction in the transient pulses generated due to radiation strik es. The passgate could al w ays be replaced by a transmission gate to a v oid a threshold drop. This is especially true as supply v oltage is reduced due to scaling trends which leads to the decrease in noise mar gins. Ho we v er since this requires addition of tw o more transistors, we chose to stick with the passgate solution in order to reduce the area o v erhead. The radiation block er circuit though ef fecti v e in reducing transients due to radiation strik es has an impact on area and delay when attached to the output of a standard cell. A generalized approach 51 PAGE 63 to reduce the area o v erhead is with the use of more compound/comple x gate realizations. A compound gate is formed by the combination of series and parallel MOS structures with complementary pullup and pulldo wn logic. As these gates are b uilt using static CMOS style, the y are called static CMOS comple x gates (SCCG) [82]. The limitation with SCCG gates is that if the number of transistors in series e xceeds an upper limit in an y path of pullup or pulldo wn logic, then there is a hostile ef fect on the propagation delay of the gate. T ypically this upper bound can be safely x ed to three or e v en four transistors. Thus, comple x logic gates can be pro vided as input during the technology mapping phase and the nodes in the corresponding circuit can be protected with the radiation block er circuit. W e used a mix of some simple standard cells lik e in v erter nand2, nor2 etc with SCCG standard cells lik e A OI222, O AI222 etc. Although, simple cells were required to enable technology mapping co v er all types of logic function lar ge number of SCCG gates could be found in the mapped netlist. This technique does reduce the o v erheads for our proposed approach, ho we v er if we protect all logic gates with the radiation block er circuit then the area and delay o v erhead of the entire circuit can still be signicant. The area and delay o v erhead for v arious protected standard cells are sho wn in T able 5.1 which sho ws that the o v erall area o v erhead can more than 60%. Based on this observ ation, we de v eloped an algorithm that e xploits the asymmetric distrib ution of masking probability to optimize only selecti v e nodes of a logic circuit. Ne xt, we present metrics to estimate the v arious soft error masking ef fects on the circuit nodes. 5.2 Selecti v e Insertion Algorithm As discussed in section 5.1, the SER sa vings by using the radiation block er circuit may be nullied due to the o v erheads in delay and area. The o v erheads for protecting standard cells with the radiation block er circuit may be reduced by enforcing the use of SCCG gates. Ho we v er blind protection by the use of radiation block er on all gate output nodes will result in signicant o v erheads. In this section, we propose an algorithm for selecti v e insertion of radiation block er circuits on cell outputs to pro vide reduction in circuit SER with v ery lo w performance and area o v erheads. The CPO metric that is de v eloped in the pre vious section is le v eraged to selecti v ely optimize vulnerable 52 PAGE 64 T able 5.1 Ov erhead of Adding Radiation Block er Circuit for V arious Standard Cells Cell Name % Area Ov erhead % Delay Ov erhead INVX16 74.14 3.4 N AND2X4 74.14 3.5 NOR2X4 74.14 3.2 A OI21X4 52.95 2.6 A OI211X4 41.19 3.4 A OI22X4 41.19 2.3 A OI221X4 30.89 3.4 A OI222X4 26.48 2.7 O AI33X1 52.95 2.2 O AI21X4 52.95 3.2 O AI211X4 41.19 3.1 O AI22X4 37.07 2.4 O AI221X4 30.89 3.0 O AI222X4 26.48 2.2 circuit nodes. Thus, the asymmetric distrib ution of SER masking probability is used to pro vide high SER sa vings for the circuit while mar ginally impacting delay and area. A combinational circuit without feedback can be modeled as a directed ac yclic graph (D A G). The D A G can be made polar by assigning a dummy source node connected to all primary inputs and a sink node connected to all primary outputs. The earliest arri v al time (EA T) of each net can no w be computed by tra v ersing the D A G in the topologically sorted order from the source and assigning the EA T of a gate output as the maximum of the EA Ts of its inputs plus the delay of the gate. Similarly the latest arri v al time (LA T) of each net can be computed by tra v ersing the D A G in the topologically sorted order from the sink and assigning the LA T of a gate input as the minimum of the LA Ts of its outputs minus the delay of the gate. The dif ference of the LA T and the EA T pro vides the slack for each net. The pr obability for r adiation bloc k er cir cuit insertion (PRI) of each gate output is no w computed by taking a weighted combination of the slack and the CPO at each cell output net. Thus the PRI at the output of a gate i can be e xpressed as, PRIiW SE RC POi W sl ac ksl ac ki(5.2) 53 PAGE 65 Algorithm 3 Steps for the Proposed Selecti v e SER Optimization Using Radiation Block er Cells (i) Perform technology mapping to structural netlist. (ii) Read in structural netlist as a graph and create a polar D A G with source and sink nodes. (iii) Estimate the logical observ ability of all the nets using the signal probability at the primary inputs. (i v) Populate the load caps at each internal node and compute the electrical observ ability of all internal nets. (v) Populate node delays using the gate type and the load cap and compute timing windo w observ ability for each net. (vi) Compute the CPO of each net. (vii) Perform a topological sort from the source node and calculate the EA T of each net. (viii) Perform a topological sort from the sink node and calculate the LA T of each net. (ix) Compute the slack of each net. (x) Compute the PRI v alues of each net by taking the product of the CPO and the slack. (xi) Select the M% topmost gates based PRI v alues ( G M ). (xii) Radiation block er cells are inserted at the output of the gates selected in G M where C POiand sl ac kiindicate the CPO and the slack at the corresponding gate output, while W S E R and W sl ac k are user dened weights to tradeof f SER optimization and the corresponding delay o v erhead. A higher v alue of PRI of a gate output indicates that the corresponding net has a high slack and is highly susceptible for soft errors upsets at the re gisters or primary outputs due to radiation strik es on the acti v e area of the gate. Thus, protecting selecti v e circuit nodes ha ving higher PRI v alues ensures that radiation block er cells protect nodes on the noncritical path, b ut those which are highly susceptible to soft errors. W e select a set of M % of the gate nodes, G M by sorting the v arious gate output nets based on its PRI v alues. Choosing v arious v alues of M can be used to tradeof f SER reduction with area and delay o v erhead. W e ha v e e xperimented with dif ferent v alues of M W e pro vide detailed results of the tradeof f in SER reduction on area and delay o v erhead with v arying W S E R W sl ac k and M in the e xperimental results section. The proposed algorithm for the selecti v e insertion of radiation block er circuits for SER optimization can be summarized in Algorithm 3. As sho wn, the algorithm starts with a initial technology mapped netlist and then computes the CPO and slack v alues for each net. The PRI v alues are then computed for each gate output. The top M % of the gates based on PRI v alues are selected for insertion of radiation block er circuit. The computational comple xity of the algorithm (not consid54 PAGE 66 Select top M% of the nodes probabilities Extract node signal Synopsis Design Compiler benchmarks Behavioral ISCAS Cell Library (.lib) Nangate Standard and Delay overhead Estimate Area Estimate glitch reduction factor and % increase in delay and area/power Technology mapped netlist for all circuit nodes Compute CPO and slack Extract node caps based on the structural netlist Calibre RC extractor Cell Layout Radiation Blocker Nangate Standard Library HSPICE simulations standard cells and radiation blocker cell Extracted spice level netlist with RC parasitics of Estimation for SER Methodology ASSA in circuit SER Estimate reduction SER Calculator Cumulative circuit Calculator Window Masking Node Timing Node Logical Masking Calculator Masking Calculator Node Electrical Database NRC Curve Engine Calibration Technology Kit FreePDK 45nm Cell Layouts Figure 5.4 Simulation Flo w: SER Reduction by Using Radiation Block er Circuits ering technology mapping) is dominated by use of computation of topological sort which is used for computation of CPO (steps iii and v in Algorithm 3) and slack at the gate outputs (steps vii and viii in Algorthm 3). The computational comple xity of topological sort depends on running DFS on the circuit graph and roughly proportional to On 2when n is the number of gates in the circuit. Steps ii,i v ,vi,ix and x are linear time and hence is proportional to Onwhile step xi is constant time ( O1). Thus the o v erall computational comple xity of the algorithm is quadratic in the number of gates. 5.3 Experimental Results The proposed algorithm w as implemented on 1.5Ghz UltraSparc processor with 4GB of memory and running SunOS 5.8. The results were v alidated using the ISCAS'85 benchmark circuits. W e e xperimented with our proposed approach using the FreePDK 45nm Process Design Kit [31] and the Nangate standard cell library [34] based on the 45nm technology Synopsys Design Compiler w as used to do the initial technology mapping and for computing the enabling probability of 55 PAGE 67 Figure 5.5 Layout of the Radiation Block er Circuitry the nets. W e modeled radiation strik es of deposited char ges in the range of [60fC, 135fC] with current sources as dened in equation 1 with a t a of 10ps and t b of 5ps as in [39]. The range of deposited char ges were considered based on typical radiation ux at the seale v el [28]. The layout of radiation block er circuitry w as created in Cadence V irtuoso(sho wn in Figure 5.5) and the netlist with e xtracted parasitics were then simulated in SPICE for the original standard cell and the standard cell protected with the radiation block er circuitry at its output node. W e estimated the delay and area o v erheads for each standard cell for adding the radiation block er circuit using the SPICE simulations and the layouts. Man y soft error estimation tools ha v e been reported in literature [26, 29, 33, 59]. The ASSA methodology [59] yields results quite close in accurac y to actual SPICE simulations and is signicantly f aster than other tools for SER estimation. W e ha v e implemented a v ersion of this tool inhouse for our SER estimation. The o v erall simulation o w is gi v en in Figure 5.4. 56 PAGE 68 T able 5.2 Experimental Results for ISCAS'85 Benchmark Circuits Ckt % Reduction in SER % Delay Ov erhead % Area Ov erhead M=1% M=5% M=10% M=1% M=5% M=10% M=1% M=5% M=10% c432 6.20 18.76 32.54 0.18 0.37 0.54 2.72 9.53 19.07 c499 4.95 15.41 28.85 0.00 0.05 0.23 2.33 9.35 18.70 c880 17.74 44.46 61.76 0.07 0.20 0.43 2.18 9.48 18.97 c1355 4.07 13.67 25.33 0.00 0.06 0.06 2.39 9.56 19.13 c1908 14.25 36.70 53.00 0.09 0.09 0.28 2.31 9.84 19.11 c2670 18.39 64.91 76.27 0.07 0.07 0.32 2.17 9.54 18.66 c3540 16.71 39.49 54.01 0.00 0.02 0.05 2.13 9.43 18.56 c5315 18.81 43.36 59.42 0.00 0.00 0.00 1.93 9.45 18.72 c6288 30.54 50.33 62.43 0.00 0.00 0.00 1.90 9.33 18.58 c7552 14.14 41.58 59.38 0.02 0.33 0.42 1.94 9.35 18.57 A V G 14.58 36.87 51.30 0.04 0.12 0.23 2.20 9.49 18.81 In T able 5.2, the results for ISCAS85 benchmarks are sho wn for reduction in SER along with the corresponding o v erheads in delay and area. The results are reported for v arying v alues of M with the W S E R and W sl ac k being x ed at 0.9 and 0.1 respecti v ely The SER reduction w as calculated as the decrease in SER of the selecti v ely protected circuit compared with the SER of the original circuit di vided by the original SER. Similarly the delay(area) o v erhead w as calculated as the increase in delay(area) of the selecti v ely protected circuit compared with the delay(area) of the original circuit di vided by the original delay(area). c2670, c6288 and c880 did v ery well in reducing SER using our proposed approach. Especially for c6288 e v en when only 1% of cells were selected for the selecti v e insertion of the radiation block er circuit more than 30% reduction in SER w as achie v ed at no delay penalty and a area o v erhead of only 1.9%. c432 sho wed high susceptibility of increasing delay o v erhead with increase of M while increase in area o v erhead w as more or less similiar accross the v arious benchmarks. As sho wn in T able 5.2, on the a v erage, our proposed approach can achie v e SER reduction of as much as 51% with area o v erhead of 18% and delay o v erhead of only 0.2%. As po wer scales v ery well with the circuit area, we belie v e that the po wer o v erhead of our proposed approach will also be quite less. W e also e xperimented with v arying the weights for pro viding relati v e importance to SER and slack ( W S E R and W sl ac k ) during computation of PRI v alues at v arious v alues of M The results of 57 PAGE 69 Figure 5.6 Comparison of SER Reduction for Dif ferent User Dened P arameters these comparisons are indicated in gures 5.6 5.8. As sho wn in Figure 5.6, the rate of SER reduction increases signicantly when pro viding higher v alues to W S E R while as sho wn in 5.7, increase in delay o v erhead is only mar ginal. The ef fect is especially pronounced at higher v alues of M As sho wn in gure 5.8, changing the weights W S E R and W sl ac k at a x ed v alue of M does not af fect area o v erhead, which is e xpected. 5.4 Comparison with Related W orks W e understand that the comparison of our w ork to other circuit le v el w orks for impro ving SER is not straight forw ard, since the base simulation platform for the methods are quite dif ferent. F or e xample, the shado w gates technique [39] uses 65 nm BPTM technology for their simulations while the schmitt trigger based technique in [38] uses 0.35 um technology libraries for their e xperiments. W e therefore pro vide a qualitati v e comparison of our w ork with other circuit le v el techniques for node hardening. The shado w gates with diode clamper based technique incurs a lo w delay o v erhead for hardening circuit nodes [39]. Ho we v er the duplication of entire cells lead to high area o v erheads. This is especially true for comple x standard cells with man y transistors. The area o v erhead of our proposed approach is, on the other hand, irrespecti v e of the type of standard cell. Inf act as we ha v e sho wn, the SCCG gates using a radiation block er circuit incur relati v ely lo w area o v erheads. Also, due to process v ariations duplicate gates in the shado wing technique may not ha v e the e xact delay as 58 PAGE 70 Figure 5.7 Comparison of Delay Ov erhead for Dif ferent User Dened P arameters original gate. This in turn may af fect the performance of the hardened standard cell. Our approach, does not suf fer through such a limitation. Complimentary pass gates can act as a lo w pass lter for glitches induced by radiation strik es [37]. Ho we v er the method can only eliminate transient pulses with lo w or moderate magnitudes. High amplitude pulses are attenuated b ut are not completely eliminated. Hence, protection against a subset of radiation induced SETs can only be achie v ed. Otherwise, lar ge sized pass gates or a chain of pass gates need to be used. This can mak e it e xpensi v e in terms of delay for realistic radiation ux found in sea le v el. The schmitt trigger based technique [38] uses e xplicit feedback of stored char ge to ght the transient char ges injected during a radiation strik e. This idea has also been used in [36] in the conte xt of dynamic gates or latches and in [40] for static CMOS circuits. Ho we v er due to technology scaling, v ery less amount of char ge is stored in the feedback node. This seriously impacts the glitch reduction capability of these circuits at scaled technology nodes especially where the soft error problem becomes an e v en bigger challenge. In contrast, our approach uses the characteristic of the radiation induced transient itself to detect occurence of a radiation strik e and cuts the af fected cell hit by the strik e from pro viding input dri v e to the dri v en cell. W e also note that man y w orks e xist for selecti v e sizing of gates of a circuit [84], [42], [45] and simultaneous sizing and ipop selection [30] for SER reduction. Ho we v er we felt that it w as not f air to compare these logic le v el sizing approaches to our approach which predominantly depends 59 PAGE 71 Figure 5.8 Comparison of Area Ov erhead for Dif ferent User Dened P arameters on circuit le v el hardening. W e strongly feel that such sizing approaches can be applied o v er and abo v e our proposed approach to further reduce the circuit SER. 60 PAGE 72 CHAPTER 6 LOGIC LEVEL RELIABILITY CENTRIC GA TE SIZING The trends in technology scaling, the shrinking de vice dimensions, the en vironmental noise f actors and the uncertainty due to process v ariations ha v e signicantly impacted the reliability and yield of inte grated circuits in the nanometer re gime. The transient f aults, also kno wn as soft er rors, induced by particles arising from radiation strik es could occur not only in memory elements b ut also in the internal nodes of combinational and sequential logic, which can propagate to other nodes posing a signicant threat to the signal inte grity in circuits. Further crosstalk noise due to the cross coupling capacitance among wires placed proximally close is another major challenge tow ards achie ving high signal quality The presence of uncertainty due to process v ariations mak es it dif cult to analyze and estimate noise in circuits. In this w ork, we in v estigate a ne w approach for reliability centric gate sizing in which the objecti v e is the simultaneous optimization of both the soft error rate (SER) and the crosstalk noise besides the po wer and performance of circuits while considering the ef fect of process v ariations. In the proposed approach, the soft error rate for a gate is modeled as a rst order function of the gate size and the sizes of the gates in its transiti v e f anin. The glitch masking ef fects are accurately captured by using tw o ne w metrics called the glitc h enabling pr obability (GEP) and the cumulative pr obability of observability (CPO) dened based on the signal probabilities of the nets. The crosstalk induced noise is modeled at the logic le v el based on the clustering of the structural netlists using the Rents rule. While the clustering algorithm iterates until the dif ference in Rent' s e xponent v alues between an y pair of clusters f alls belo w a user dened threshold, the crosstalk noise is optimized by minimizing the pair wise dif ferences in sizes of all cells within a cluster Further maximizing the v ariance in the gate sizes results in maximizing the a v ailable slack which in turn minimizes the delay uncertainty due to process v ariations, thus impro ving the timing yield. The rst order modeling of SER and the crosstalk noise at the 61 PAGE 73 logic le v el reduces the number of decision choices thus reducing the search space resulting in an ef cient optimization algorithm. The resulting gate sizing problem is formulated as a nonlinear mathematical program which is solv ed using the interior point method. Experimental results on ISCAS'85 benchmark circuits indicate signicant impro v ement in SER, crosstalk noise, po wer and timing yield compared to the corresponding constrained optimization formulations. 6.1 Interaction of V arious Noise Sour ces under Pr ocess Uncertainty In the nanometer re gime, the interconnect geometries in inte grated circuits ha v e aggressi v ely scaled do wn. While this has reduced the self capacitance of the wires, the coupling capacitance between the wires has become a challenge to the reliability of the systems. T w o closely spaced nets are treated as victim or aggressors each with respect to the other net. A victim net is often adv ersely af fected by the transitions on the aggressor net due to the coupling capacitance between them which may lead to functional f ailures (wrong logic computation or longer circuit delay) leading to a timing f ailure. If the victim switches in the same direction as the aggressor the signal transition in the victim is hastened leading to hold time violations, while if the victim switches in the opposite direction as the aggressor the signal transition is delayed leading to setup time violations and timing f ailure. If the victim line is at steady state and a switching in the aggressor induces a signal higher(lo wer) than its high(lo w) logic le v el, a bootstrap noise results. In general, crosstalk noise depends primarily on the coupling capacitance between the interconnects, the spacing between the wires, and the sizes of the victim and the aggressor gates. In this section, we illustrate ho w the ef fects of crosstalk noise and soft errors are compounded due to their simultaneous presence in a circuit. Consider the occurrence of a radiation induced glitch on the output of gate G4 in Figure 6.1. The glitch may be suf ciently separated in time from the arri v al of the clock edge. Ho we v er the induced crosstalk due to C z causes a spurious clock pulse and the glitch is latched. Note that G4 may ha v e high masking probability due to its lar ge timing windo w masking f actor b ut a soft error still results due to the coupled capaciti v e ef fect. Again, consider the case that due to logic transitions on aggressors G1 and G5 a crosstalk noise appears on victim G3 through coupling capacitors C x and C y The amplitude of such crosstalk noise may be small to 62 PAGE 74 1 > 0 0 0 > 1 1 0 1 > 0 0 > 1 Cz G4 G6 G5 G3 Cy Cx G2 G1 Clock D Q' Q Figure 6.1 Interaction of Soft Errors and Crosstalk Noise cause functional f ailure. The glitch will be intensied, ho we v er if a radiation strik e occurs on gate G3 at the same time, ultimately leading to a functional f ailure. Thus, it can be seen that although a single noise source may not af fect the functionality of a circuit, the simultaneous presence of v arious noise sources could intensify the ef fects of each other As the vulnerability of the system is more se v erely af fected in the presence of multiple noise sources, the analysis and optimization of circuits considering the ef fect of a single noise source could be inef cient and pessimistic. Ne xt, we illustrate ho w deterministic noise a v oidance techniques are rendered inef fecti v e in the presence of e xtreme process v ariations in current designs. A popular strate gy for soft error detection is based on cost ef fecti v e partial duplication [85]. Ho we v er as sho wn in Figure 6.2, the delay of the duplicated logic block may be quite dif ferent from the original logic due to process v ariations. Thus, if the delays of the original and the duplicated blocks are signicantly separated in time, the abo v e approach is no longer applicable for error detection. Further the signal switching delay of a victim net is af fected due to switching on its coupled aggressor nets. The delay of the signal in the victim net could be lar ger(smaller) when the aggressors are switching in the opposite(same) direction as the victim. Thus, crosstalk noise can create delay uncertainty Delay uncertainty also results in the presence of process v ariations due to uncertainty in gate length, oxide thickness, gate threshold v oltage etc. A unied scheme is thus required which can handle delay uncertainty due to both crosstalk noise and manuf acturing v ariations. The uncertainty in propagation delay of signals can cause violations in setup and hold timing constraints resulting in timing f ailure of the design [72]. 63 PAGE 75 Function Logic Cutset Logic'Checker ...Error Indication...Primary Outputs Primary Inputs Cutset Logic...... Figure 6.2 Soft Error Mitigation under Process Uncertainty I2 I3 I4 Strike Radiation O1 1 1 A B I1 O1 0 Strike Radiation I3 I4 I2 I1 B A 1 I1 I2 I3 I4 Radiation Strike O1 B 1 A (A) Transient Pulse Generation (B) Transient Pulse Propagation B A O1 0 Strike Radiation I2 I1 I3 I4 Figure 6.3 First Order Model on Soft Errors of Logic Circuits with V arying Gate Size 6.2 Logic Le v el Modeling of the Design Metrics In this section, we describe our methodology for modeling at the logic le v el the v arious metrics for design optimization lik e SER, crosstalk noise, po wer and delay In Section 7.2.1, we pro vide some background on soft errors in logic circuits follo wed by a rst order model for the optimization of the soft error rate. In Section 7.2.2, we present a Rent' s rule based clustering method to model crosstalk noise ef fects in the logic le v el. Finally in Section 7.2.3, we describe the po wer and timing models. 64 PAGE 76 6.2.1 First Order Modeling of Glitch Masking Effects W e ha v e de v eloped a rst order model of the soft error in a circuit node by only considering, the ef fect of the size of the gate and the sizes of the gates in the tr ansitive fanin of the gate It can be sho wn that such an approximation leads to ne gligible error in SER estimation [30]. As sho wn in Figure 6.3(A), higher gate sizes ha v e an adv erse ef fect in pr opa gating transient pulses. This is due to the f act that such a signal is amplied to a greater e xtent by a lar ger gate. Thus, smaller sized gates are good for ltering out transient pulses due to radiation strik es and hence can ef fecti v ely reduce circuit SER. Ho we v er as sho wn in Figure 6.3(B), smaller sized gates ha v e a lo wer Q cr i t and are easily vulnerable to transient pulse g ener ation follo wing a radiation strik e. Lar ger gates ha v e suf ciently higher amount of stored char ges and hence their inherent inertia pre v ents the generation of transient pulses. As described in Chapter 3, the CPO and GEP accurately captures the masking phenomenon at the logic le v el. W e select a set of gate nodes, G M by sorting the v arious gate output nets based on its CPO and selecting M % of the topmost nets. The v alue of M w as empirically selected to be 10%. 6.2.2 Cr osstalk Noise Modeling at the Logic Le v el Gate (dri v er and recei v er) sizing can be an ef fecti v e technique to reduce crosstalk noise on the nets. Ho we v er if the size of one gate is increased to reduce the crosstalk noise on its output net, the noise induced by it on the neighboring net increases. This mak es the aggressor and victim gates interchange roles thus resulting in a c yclic dependenc y in sizing order In general, crosstalk noise may be estimated v ery easily in the post routing stage. Ho we v er in this case sizing has to be limited only to free/dead space around the cell which may not yield the best solution. Alternati v ely the entire place and route needs to be repeated for the ne w sizing solution, which in turn may lead to v ery dif ferent crosstalk estimates, and the whole process needs to be iterated until con v er gence is achie v ed. T o alle viate this problem, we choose to estimate crosstalk noise at the gate le v el. It should be noted, ho we v er that modeling crosstalk noise at the logic le v el is a challenging problem. As no layout information is a v ailable, neighbouring aggressor nets of a victim net and the de gree of their o v erlap is unkno wn. This mak es estimation of coupling capacitance at the logic le v el 65 PAGE 77 Figure 6.4 Modeling Crosstalk Noise using Graph Clustering based on Rent' s Exponent V alues a v ery dif cult problem and an ef cient heuristic is required to estimate the subsequent placement and routing phases. W e ha v e modeled the ef fect of crosstalk noise by e xploiting Rent' s rule which relates the number of e xternal signal connections to the gate count of a logic block and is gi v en by Tt G p (6.1) where T denotes the number of e xternal connections, G is the gate count of the logic block, t corresponds roughly to the a v erage pin count of each gate and p is the Rent' s e xponent. The Rent' s e xponent is often used to deri v e placement models [102]. W e model crosstalk ef fects by clustering the structural netlist based on Rent' s e xponent v alues. The clustering algorithm is iterated until the dif ference in Rent' s e xponent v alues between an y pair of clusters f alls belo w a user dened threshold. Rent' s e xponent v alues are computed as in [102] for lar ge clusters and by brute force for small clusters. As sho wn in Figure 6.4 for an e xample circuit (c17), we found that a generated cluster is quite accurate in pro viding an estimate of the local placement around each cell in the placement phase that follo ws. The nets for the cells of the same cluster are thus highly probable to 66 PAGE 78 be coupled together and hence can induce crosstalk noise. The crosstalk noise can then be optimized by minimizing the dif ference in sizes of all the cells in the same cluster Our simulations suggest that such a Rent' s e xponent based clustering approach along with routing hints is quite ef fecti v e in modeling crosstalk noise at the logic le v el. 6.2.3 P o wer and T iming Models The dynamic po wer consumption of a gate i is gi v en as, P i1 2 f V 2 d d E iC iC wir e P sc (6.2) where, P i is the total dynamic po wer consumed by gate i f is the clock frequenc y V d d is the supply v oltage for the gate, E i is the a v erage switching acti vity of the gate, C i is the intrinsic gate capacitance, C wir e is the sum of all the interconnects that f anout from gate i and P sc is the shortcircuit po wer Reducing the size ( S i ) of the gate reduces the intrinsic gate capacitance of gate i po wer consumption and f anin load capacitances of the gates in the transiti v e f anout of the gate i A nonlinear delay model of a gate i is gi v en by the logical ef fort model [68, 81], d ia ib i j e f oniS j S i (6.3) where, S i refers to the size of gate i f oiis the set of gates that f anout from gate i constant coefcients a i and b i are empirically determined for each type of cell by processing the v endor specic library le. The delay v alues for v arious load v alues with dif ferent dri v e strengths is e xtracted from the library le and the nonlinear coef cients are determined by using a curv etting program. 67 PAGE 79 6.3 Gate Sizing F ormulation The minimum delay in a circuit is achie v able by respecting the static timing analysis constraints and hence solving the unimetric constrained optimization problem as sho wn belo w min T max (6.4) stS min i%S i%S max i and D g%T ou t&g e G and T ou t%T max&ou t e PO and D gT in a ib i j e f oniS j S iwhere, T in and T ou t is the specied timing tar get of the input and output of a gate, g denotes a particular gate in a circuit which belongs to the set of all gates G and PO is the set of all primary outputs. S min i and S max i for gate i accounts for the size restrictions for that gate type in the library If we only tar get po wer minimization, then all the gates can be set to minimum size. The minimum po wer in a circuit is thus obtained by solving the follo wing unimetric constrained optimization problem, min i P i (6.5) stS min i%S i%S max i and D g%T ou t&g e G and D gT in a ib i j e f oniS j S iwhere, P i is same as dened in equation 6.2, while the other notations are the same as dened in equation 6.10. 68 PAGE 80 Using the rst order soft error analysis described in Section 6.2.1, a nonlinear mathematical program for unimetric constrained SER a w are gate sizing can be formulated as sho wn belo w max k e G M X k S k (6.6) stX j i e f injS i'GE P i j S j % 1&j e G M and1%X j%1&j e G M and S min i%S i%S max i and D g%T ou t&g e G wi t h D gT in a ib i j e f oniS j S iwhere f iiis the set of gates that is in the transiti v e f anin of gate i GE P i j is the GEP for a net i for gate j G M is the selected set of M gates as described pre viously and X i are slack v ariables for the mathematical program for the M selected gates. Our con v e x program accounts for the rst order soft error analysis by selecti v ely choosing to minimize S i (by maximizing(S i ) or to maximize S i If X j happens to be positi v e to satisfy the rst constraint in equation 6.7, simultaneously maximizing X j S j and bounding X j to be in [1,1] pushes X j to w ards the v alue of 1 and leads the objecti v e function to maximize S i Similarly if X j happens to be ne gati v e to satisfy the constraint, simultaneously maximizing X j S j and bounding X j to be in [1,1] pushes X j to w ards the v alue 1 and leads the objecti v e function to maximize(S i The formulation of the rst constraint in equation 6, actually captures whether there is a higher probability of glitch occurring at the gates in the transiti v e f anin of the gate i or if there is a higher probability of glitch occurring on the gate i itself. Thus, simultaneously bounding the slack v ariables, along with maximizing X k)S k forces selecti v e minimization or maximization of gate sizes of the gates depending on the higher probability of the occurrence of a glitch, either on transiti v e f anin of the gate or on the gate itself. Further using the Rent' s rule based graph clustering for crosstalk noise modeling described in 6.2.2, a mathematical program for optimizing crosstalk noise ef fects during gate sizing can be 69 PAGE 81 represented as, min C ij e CS iS j2&C e C l us t er s gra ph (6.7) stS min i%S i%S max i and D g%T ou t&g e G wi t h D gT in a ib i j e f oniS j S iwhere C l us t er s gr a ph denotes the set of clusters obtained using hierarchical graph clustering on the structural netlist graph. W e can no w pro vide a multimetric gate sizing scheme for simultaneous optimization of SER, crosstalk noise and po wer under delay constraints as follo ws, minc 1 k e G M X k S k R uc 2 i P i P uc 3 C ij e CS iS j2 N u (6.8) stX j i e f injS i'GE P i S j % 1&j e G M and1%X j%1&j e G M and D g%T ou t&g e G and D p%T s pec&p e PO nod es wi t h D gT in a ib i j e f onis j s iwhere, PO nod es are the gates connected to the primary outputs, T s pec is the specied timing tar get based upon solving equation 5, while P u R u and N u are the solutions obtained after solving the unimetric optimization problems gi v en by equations 68, and helps in normalizing the v arious metrics in the multimetric objecti v e function. c 1 c 2 and c 3 are user dened weighting coef cients controlling the optimization of SER, po wer and crosstalk noise respecti v ely As discussed pre viously process parameter v ariations impact the gate delays of the cells in a design thus impacting the o v erall circuit delay and hence af fecting the timing yield of the design. W e minimize the timing yield loss due to process v ariations by maximizing the delay v ariance at each node in the timing graph. The maximum delay v ariance is achie v ed by adding slack v ariables for each node in the node based ST A formulation and then maximizing the sum of such slack v ariables for all nodes. Further our Rent' s e xponent based clustering formulation to optimize crosstalk 70 PAGE 82 noise can also be used ef fecti v ely to model spatial correlations among process parameters. Nodes belonging to the same cluster are lik ely to be placed closer together by the placer and hence will ha v e similar delay v ariance characteristics. Thus, a mathematical program for optimizing timing yield under delay constraints can be stated as follo ws, max g s g (6.9) stS min i%S i%S max i and D g%T ou t&g e G and T ou t%T max&ou t e PO and D p%T s pec&p e PO nod es and ij e Cs is j2%d&C e C l us t er s gra ph and D gT in a ib i j e f oniS j S i s g where d is user dened threshold parameter that models spatial correlation in the delay v ariance of a gate. W e can no w pro vide a multimetric gate sizing formulation to w ards simultaneous optimization of po wer SER, crosstalk noise and parametric yield under delay constraints as follo ws, minc 1 k e G M X k S k R uc 2 i P i P u (6.10)c 3 C ij e CS iS j2 N uc 4 g s g V u stX j i e f injS i'GE P i S j % 1&j e G M and1%X j%1&j e G M and D g%T ou t&g e G and D p%T s pec&p e PO nod es and ij e Cs is j2%d&C e C l us t er s gra ph wi t h D gT in a ib i j e f onis j s i s g where, V u is the normalization f actor obtained after solving equation 10 and c4 is user dened parameter controlling the optimization of timing yield. 71 PAGE 83 Algorithm 4 Steps in the Proposed Gate Sizing Algorithm (i) Perform technology mapping to structural netlist. (ii) Read in structural netlist as a graph. (iii) Estimate the CPO and GEP of all the nets using the enabling probability of the nets. (i v) Select the M topmost gates based CPO v alues. (v) Solv e unimetric delay minimization problem to determine the timing tar get (vi) Solv e unimetric po wer minimization problem. (vii) Solv e unimetric SER maximization problem. (vii) Solv e unimetric crosstalk minimization problem. (vii) Solv e unimetric delay v ariance maximization problem. (ix) Use the solutions for the abo v e four problems to form a multimetric optimization problem and decide a timing tar get using the solution of the unimetric delay minimization problem. (x) Solv e the corresponding problem using KNITR OS. (xi) Discretize the solution into a structural netlist. The steps for the proposed reliabilitycentric gate sizing algorithm under process v ariations is illustrated in Algorithm 5. The algorithm initially solv es the unimetric optimization problems for reliability (both for crosstalk noise and SER), po wer and delay These solutions are used to for mulate the unied optimization problem. The continuous gate sizes are discretized using a nearest neighbor function, as in [78, 81], to produce a sized structural netlist. 6.4 Experimental Results The proposed gate sizing algorithm w as implemented on 1.5Ghz UltraSparc processor with 4GB of memory and running SunOS 5.8. The results were v alidated using the ISCAS'85 benchmark circuits [70]. W e used a subset of cells (INV NOR2, N AND2, XOR2) from the TSMC 90 nm technology library for our simulations. Synopsys Design Compiler w as used to do the initial technology mapping and for computing the enabling probability of the nets. The Rent' s e xponent based clustering w as performed by a separate C script. The con v e x programs were solv ed using the KNITR OS optimization solv er [71] a v ailable from the NEOS serv er The coef cient weights were chosen empirically in our formulation, and were set to equal weights of 0.25 each. Man y soft error estimation tools ha v e been reported in literature [29, 33]. W e chose the SEA T LA tool [29] for soft error rate estimation, primarily for its speed and accurac y and also because it models the entire spectrum of neutron strik es (from char ge v alues in the [10fC,150fC] range). The 72 PAGE 84 Extract next response and use the respective cap file top down approach and compute circuit SER using the Run SEATLA for each subcircuit sizing solutions) Extract cap file based on the structural netlist ( 2 cap files for 2 different for each i/p vector Generate responses N Random i/p Vectors random i/p vectors Script to generate If (i PAGE 85 T able 6.1 Experimental Results on Benchmark Circuits ISCAS' 85 T iming Y ield Calculation at Dif ferent T iming Mar gins for Benchmark Deterministic W orst Case Proposed Approach circuits at 5% at 15% at 30% at 5% at 15% at 30% c17 79.48 82.99 99.91 95.57 99.08 99.91 c432 51.74 87.45 99.1 61.72 92.1 99.32 c499 41.79 84.24 97.81 56.57 88.53 98.9 c880 52.76 87.51 98.38 73.92 93.03 98.41 c1355 42.56 84.56 95.68 62.75 91.31 97.51 c1908 43.04 84.38 97.84 66.42 89.58 98.66 c2670 62.96 81.27 94.19 71.6 84.02 96.84 c3540 47.59 86.96 96.63 54.29 88.91 97.46 c5315 48.33 88.52 99.23 83.26 99.04 99.93 c7552 42.94 85.4 98.08 57.88 95.17 99.63 tool, ho we v er has a pathbased approach which w as impractical for lar ge circuits due to the e xponential number of paths. The o v erall circuit SER w as therefore computed by a topdo wn procedure. W e partitioned the lar ger circuits using into subcircuits and the SER w as computed for each subcircuit using SEA T LA. W e computed the product of the SER of the subcircuit closer to primary outputs with the SER of the subcircuit at its f anin multiplied by its GEP The SER for a top le v el circuit w as computed as the sum of such products for all subcircuits at its transiti v e f anin. The entire o w is repeated for se v eral random input v ectors to compute an a v erage SER rate. Crosstalk noise w as estimated by placement and routing of the structural netlist in Cadence Encounter and then using Cadence Fire'n'Ice for e xtracting the parasitic resistance and capacitance v alues in SPEF format. Fire'n'Ice also pro vides the net lengths of the coupled nets to a particular net in the SPEF le. The top fe w aggressor nets were identied for each net using a C script. Subsequently the a v erage crosstalk noise is estimated as in discussed in [69, 82] with a separate C script. W e estimated the timing yield of the netlist using a inhouse implementation of a SST A tool, which is based on propagating discrete probability distrib utions through the structural netlist as described in [73]. The v ariance of the indi vidual gate distrib utions were modied to model spatial correlations using the information from the placement tool. 74 PAGE 86 Figure 6.6 A v erage T iming Y ield at Dif ferent T iming Mar gins W e estimated the timing yield for a deterministic delay optimized gate sizing formulation with w orst corner case v alues and for a gate sizing approach as pro vided in equation 8 at dif ferent timing tar gets. The timing tar gets w as decided based on adding a T % mar gin on the critical delay obtained using traditional gate sizing for delay optimization. The timing yield for v arious v alues of the timing mar gin T is sho wn in T able 1. As sho wn in the Figure 6.6, our gate sizing methodology considering process v ariations can impro v e the timing yield of designs by more than 15% for designs with timing mar gins of less than 5%. W e illustrate the results of SER reduction on ISCAS85 benchmarks for singlemetric SER optimization and multimetric optimization technique considering simultaneous optimization of SER, delay and po wer As sho wn in Figure 6.7, for singlemetric SER optimization, the a v erage SER reduction o v er unimetric delay optimization is 52% and 25% o v er unimetric po wer optimization. F or multimetric optimization, the a v erage SER reduction o v er unimetric delay optimization is 45% and 11% o v er unimetric po wer optimization. W e compute the delay o v erhead of the singlemetric SER optimization and the multimetric optimization as the percentage increase in normalized delay compared to unimetric delay optimization. W e also estimated the percentage reduction in dynamic po wer crosstalk noise and SER rate for the multimetric gate sizing formulation considering process v ariation with the corresponding constrained optimization with a single objecti v e while the other metrics were constrained to the 75 PAGE 87 Figure 6.7 SER Reduction for ISCAS85 benchmarks v alue obtained using the multimetric sizing scheme. F or e xample, for comparing po wer reduction we constrained the SER, crosstalk noise and parametric yield to be the same as obtained in the multimetric optimization and formulated the po wer optimization problem under these constraints. As sho wn in gure 6.8, on a v erage the multimetric optimization methodology sho wed about 39% impro v ement in dynamic po wer reduction. F or comparing SER reduction we constrained the po wer crosstalk noise and parametric yield to be the same as obtained in the multimetric optimization and formulated the SER optimization problem under these constraints. As sho wn in gure 6.8, on a v erage the multimetric optimization methodology sho wed about 21% impro v ement in soft error rates. Figure 6.8 Impro v ement in SER, Crosstalk Noise and Po wer 76 PAGE 88 W e also compared reduction in a v erage crosstalk noise by constraining the po wer SER and parametric yield to be the same as obtained in the multimetric optimization and formulated the crosstalk optimization problem under these constraints. As sho wn in gure 6.8, on a v erage the multimetric optimization methodology sho wed about 26% impro v ement in a v erage crosstalk noise. W e also noted that methodology when tested on lar ger benchmarks f ailed due to memory and size limitations of the KNITR O solv er This is inherently a limitation of the solv er used and not of the proposed approach. Further this is not se v ere because of the increasing trend to w ards shallo wer logic depths in inte grated circuits of the nanometer re gime due to high clock frequenc y requirements. 77 PAGE 89 CHAPTER 7 SOFT ERR OR T OLERANCE A T ARCHITECTURAL LEVEL W ith the continuous decrease in the minimum feature size and increase in the chip density due to technology scaling, onchip L2 caches are becoming increasingly susceptible to multibit soft errors. The increase in multibit errors could lead to higher risk of data corruption and potentially result in the crashing of application programs. T raditionally the L2 caches ha v e been protected from soft errors using techniques such as (i) error detection/correction codes, (ii) physical interlea ving of cache bit lines to con v ert multibit errors into singlebit errors and (iii) cache scrubbing. While the rst tw o methods incur lar ge area o v erheads for multibit errors, identifying the time interv al for scrubbing could be trick y In this chapter we in v estigate in detail the multibit soft error rates in lar ge L2 caches and propose a frame w ork of solutions for their correction based on the amount of redundanc y present in the memory hierarchy W e in v estigate se v eral ne w techniques for reducing multibit errors in lar ge L2 caches, in which, the multibit errors are detected using simple error detection codes and corrected using the data redundanc y in the memory hierarchy W e also propose se v eral techniques to control/mine the redundanc y in the memory hierarchy to further impro v e the reliability of the L2 cache. The proposed techniques were implemented in the Simplescalar frame w ork and v alidated using the SPEC 2000 inte ger and oating point benchmarks for L2 cache vulnerability global cache missrate, a v erage c ycle count and main memory write back rate, considering the area and po wer o v erheads. Experimental results indicate that the vulnerability of L2 caches can be decreased by 40% on the a v erage for inte ger benchmarks and 32% on the a v erage for oating point benchmarks, with an a verage multibit error co v erage of about 96%, with signicantly less area and po wer o v erheads and with virtually no performance penalty 78 PAGE 90 Figure 7.1 V ulnerability of Dif ferent Cache Or ganizations for SPECINT2000. The rest of the chapter is or ganized as follo ws. In Section 4.1, we model the vulnerablity of the L2 caches due to multibit errors using a probabilistic formulation characterized by e xtensi v e simulations for multibit errors in v arious L2 cache or ganizations. In Section 4.2, we present se v eral schemes to impro v e vulnerablity of the L2 cache based on e xploiting the redundanc y present in the memory hierarchy In Section 4.3, we present schemes to control the redundanc y for reducing the vulnerability of the L2 cache. Section 4.4 details the e xperimental methodology and the results. Finally Section 4.5 compares our redundanc y based multibit error protection frame w ork with se v eral recent w orks in literature. 7.1 Characterization of Multibit Err ors in Con v entional Caches In this section, we pro vide a characterization of the multibit error rate in co v entional caches. In particular we are interested in characterizing ho w the multibit error rate changes with cache size, associati vity and cache line size. W e assume that error detection codes (EDC) lik e CRC or Hamming distance are maintained which require much less area o v erhead than error detection and correction codes lik e ECC. Multibit errors in the dirty bit lines of the L2 caches can be detected 79 PAGE 91 Figure 7.2 V ulnerability of Dif ferent Cache Or ganizations for SPECFP2000. using these EDC codes. Ho we v er unlik e clean cache lines, the multibit errors in the dirty cache lines cannot be corrected, as no duplicate of the correct data is maintained. W e therefore dene the vulner ability of the L2 cac he as the per centa g e of dirty cac he lines within a given time interval Ne xt in Section 7.1.1, we model the vulnerablity of the L2 caches due to multibit errors using a probabilistic formulation. In Section 7.1.2, we characterize the probabilistic model through e xtensi v e simulations for multibit errors in v arious L2 cache or ganizations. 7.1.1 Pr obabilistic Characterization of Multibit Err or Rate As discussed pre viously the vulnerability of the L2 cache is gi v en by the e xpected number of dirty cache lines in a time interv al. The e xpected number of dirty cache lines (represented as ED) in a time interv al of T is the joint probability that a block with address X will be written and will not be replaced. This can be represented mathematically as: ED+*N,T 0 pBl k*B/. W rt0. E vic tcd t (7.1) 80 PAGE 92 where N is the number of blocks in the cache. Let pbl k*Brepresent the probability that a particular block B is accessed, pW rtrepresent the probability that a write occurs at that block, and pE vrepresent the probability that the block is e victed during the time period T Assuming that the e v ents are independent, we obtain from the abo v e equation: ED1*N,T 0 pbl k*BpW rt321(pE v4d t (7.2) A block B is e victed from the cache if the same set address as that of block B is generated, a tag match does not occur for none of the blocks in the set and the block B is selected for replacement by the replacement scheme. Representing this mathematically and again assuming independence we ha v e: pE v5*pS e t Ad*se t2B46321(pM4pbl k E v*B(7.3) In the abo v e equation, pS e t Ad*se t2B46is the probability that a set address is generated that has the same set as block B pMgi v es the probability that a tag match succeeds and pbl k E v*Bis the probability that the block B in that set is selected for replacement by the replacement algorithm. Based on a LR U replacement polic y for e xample, pbl k E v*Bgi v es the probability that the oldest block in the set is B Based on the abo v e equations, we can thus characterize the change in cache vulnerability due to changes in cache size. Ho we v er characterizing L2 cache vulnerability directly from the probabilistic model, due to changes in associati vity and cache line size is dif cult. Therefore, we performed e xtensi v e simulations on SPEC2000 benchmarks to characterize L2 cache vulnerability against changes in cache line size and associati vity Based on this study we estimated the probabilities for our model. 7.1.2 V ulnerability of Con v entional Cache Or ganizations In this subsection, we describe the e xperiments conducted to study the vulnerability due to multibit errors for v arious L2 cache or ganizations for estimating the probabilities of the model 81 PAGE 93 described in the pre vious subsection. Figure 7.1 sho ws the results for the SPEC2000 inte ger benchmarks. W e v aried cache sizes from 16KB to 64KB and 256KB and cache line sizes from 16 bytes to 32 bytes and 64 bytes, assuming direct and setassociati v e mapping. The vulnerabilities of the 16KB, 64KB, and 256KB caches were obtained to be 28%, 37%, and 46%, on the a v erage, respecti v ely Also, as sho wn in Figure 7.1(D), changing associati vity does not af fect much the vulnerability The 2w ay and 4w ay caches sho w slightly lo wer vulnerability than the directmapped cache. The results for the oating point benchmarks were similiar to that of the inte ger benchmarks as in Figure 7.2. The vulnerability is observ ed to be 39%, 43%, and 49% for the 16KB, 64KB, and 256KB caches, respecti v ely The oatingpoint benchmarks sho w higher vulnerability in small cache congurations than the inte ger benchmarks. The abo v e results are used to estimate the probabilities of the model described in pre vious subsection. 7.2 Redundancybased Err or Pr otection In this section, we present tw o ne w schemes that can e xploit the inherent redundanc y e xisting in the memory hierarchy to impro v e the vulnerablity of L2 cache. In Section 7.2.1, we present a scheme to e xploit the redundanc y e xisting between the write through L1 cache and the L2 cache to reduce the vulnerablity of the L2 cache. In Section 7.2.2, we describe a scheme to e xploit the redundanc y between the L2 cache and the main memory to reduce the vulnerablity of the L2 cache. 7.2.1 Exploiting L1/L2 Redundancy The redundanc y inherent in the memory hierarchy of high performance processors can be e xploited to impo v e the reliability of the L2 cache against soft errors [92]. Most commercial processors support a writethrough L1 cache and a writeback L2 cache. W e assumed that the L1 cache supports a nowrite allocate polic y and a mer ging write b uf fer e xists between the L1 cache and the L2 cache which pre v ents bandwidth and po wer bottlenecks for the writethrough L1 cache [86]. As the L1 cache is writethrough, the write operations are performed on both the L1 and the L2 cache thus maintaining redundant copies of the data. Also, there are man y cache lines that reside in both the L1 cache and the L2 cache since the y are placed in both of them on L2 cache read misses. W e 82 PAGE 94 00 1 11 0 11 1 78787 989 :8:8: :8:8: :8:8: ;8; ;8; ;8; <8<8< =8= 0 Redundancy Vulnerable Nonvulnerable No Redundancy and dirty L1 / Memory Redundancy L1 Redundancy Memory Redundancy Legend Dirty bits(2) Inclusion bit L1 Cache L2 Cache Main Memory 01 >8>8> >8>8> ?8? ?8? @8@8@8@8@8@ A8A8A8A8A B8B8B8B8B8B8B8B8B8B8B8B B8B8B8B8B8B8B8B8B8B8B8B C8C8C8C8C8C8C8C8C8C8C C8C8C8C8C8C8C8C8C8C8C D8D8D8D8D8D8D8D8D8D8D8D E8E8E8E8E8E8E8E8E8E8E F8F8F8F F8F8F8F G8G8G8G G8G8G8G H8H8H8H H8H8H8H I8I8I I8I8I J8J8J8J8J8J K8K8K8K8K L8L8L8L8L8L M8M8M8M8M8M N8N8N8N8N8N N8N8N8N8N8N O8O8O8O8O8O O8O8O8O8O8O P8P8P8P8P8P P8P8P8P8P8P Q8Q8Q8Q8Q Q8Q8Q8Q8Q Figure 7.3 Illustrating Inclusion Property and Fine Grain Dirtiness dene this implicit redundanc y between the L1 and the L2 cache lines as the inclusion pr operty of the L2 cache. Soft errors become ef fecti v e when the data items with errors are replaced from the L2 cache and written into the main memory If the data items are referenced again from the main memory the errors will be ef fecti v e and af fect program output. This ho we v er can be a v oided as redundant correct data is present in the L1 cache. Thus when a L2 cache line is replaced, the y ha v e to be check ed for soft errors. All multibit errors can be detected using con v entional error detecting codes and corrected by fetching noncorrupt data from the L1 cache. In order to support the abo v e scheme, an inclusion bit is maintaind with each L2 cache line. On a read operation, with a L1 cache miss b ut a L2 cache hit, the inclusion bit is set to 1 for the corresponding L2 cache block. Also, the L1 cache block that is being replaced due to the miss will cause the corresponding L2 cache block to ha v e no duplicates in the L1 cache. So the inclusion bit of the L2 cache block corresponding to the replaced block from the L1 cache is reset to zero. On a write operation, with a miss on both the L1 cache and the L2 cache, the inclusion bit is reset to zero for the L2 cache block (no writealloate polic y for L1). The L1 cache line is also in v alidated corresponding to the replaced L2 cache block. On a read operation, with a miss on both the L1 and L2 cache, the inclusion bit is set to 1 for the ne w cache line. 83 PAGE 95 7.2.2 Fine Grain Dirtiness The redundanc y between L2 cache and main memory assumes the form of clean L2 cache lines. Errors in clean L2 cache lines can be corrected by refetching them from the main memory whereas, the errors in the dirty cache lines are not correctable. This, ho we v er assumes that whole data in the cache line are modied. In the standard cache architecture, e v en when only one w ord is modied, the dirty bit for the entire cache line containing that w ord is set to one. Thus, we lose the information that other w ords in the cache line are clean. This problem can be alle viated by adding more dirty bits for each cache line. W e dene this as supporting negr ain dirtiness in the L2 cache. Finegrain dirtiness can be supported, for e xample, if one dirty bit can be allocated for each memory w ord. Only the dirty bit corresponding to modied memory w ord is set to one and other dirty bits are not af fected. When an error is detected in a clean L2 cache w ord during a cache read or a cache line replacement, the error can be corrected by refetching the w ord from the main memory Thus, we can correct multibit soft errors in the L2 cache and impro v e reco v er ability of the L2 cache. Area o v erhead is small for negrain dirtiness: one dirty bit for each memory w ord, which is the same o v erhead as parity check code. Supporting a dirty bit for each memory w ord does not increase the comple xity of the cache hierarchy On a read miss in the L2 cache, all dirty bits are reset to zero. The dirty bit corresponding to the modied memory w ord is set to one on a L2 cache write. From CA CTI simulation [87], the latenc y and po wer o v erhead due to additional dirty bits is much lo wer than 1% for a 256KB L2 cache with 32B cache lines. Figure 7.3 illustrates our memory hierarchy that utilizes inclusion property and supports negrain dirtiness. W ithout loss of generality the L1 and the L2 cache line sizes ha v e been assumed to be the same. Often, the lar ger L2 cache line size is assumed to be a multiple of the L1 cache line size as in [5, 30, 31]. In this case, to support inclusion property we consider that the L2 cache line is di vided into blocks of sizes equal to the L1 cache size and pro vide multiple inclusion bits for the each of these blocks. As illustrated in the gure, a multibit error in the right half of the L2 cache line with inclusion bit 0 and dirty bits 10 can be corrected by refetching the matching data from the main memory since the right half has not been modied. A multibit error in the L2 cache line 84 PAGE 96 RSRSRSRSR RSRSRSRSR RSRSRSRSR TSTSTSTST TSTSTSTST TSTSTSTST USUSU USUSU VSVSV VSVSV 0101 1 Duplication of small values LRU bits Dirty bits NMW bits Threshold Value bits Hybrid Replacement Policy Victim Cache Line Cleaned Cache Line Legend NMW : No more write bit WSW WSW XSX XSX YSYSY YSYSY ZSZSZ ZSZSZ [S[S[ [S[S[ \S\S\ \S\S\ ]S]S]S]S] ]S]S]S]S] ]S]S]S]S] ^S^S^S^S^ ^S^S^S^S^ ^S^S^S^S^ Figure 7.4 Illustrating Reliabilitycentric Replacement and Small V alue Duplication with inclusion bit 1 and dirty bits 00 will cause no writeback when it is replaced thus correcting the error All L2 cache lines with their inclusion bits 1 can be reco v ered from soft errors by refetching the corresponding L1 cache lines. Since the L1 cache lines are a small percentage of L2 cache lines, vulnerability of L2 cache does not reduce signicantly using this scheme. Also correcting a clean cache w ord by accessing the corresponding memory w ord can create a performance bottleneck. Therefore, we suggest more aggressi v e techniques in the ne xt sections which combined with the techniques already proposed will signicantly reduce the vulnerability of the L2 cache. 7.3 Impr o ving Reliability by Contr olling Redundancy In this section, we propose tw o ne w schemes to mine/control the additional redundanc y in the memory hierarhy In Section 7.3.1, we propose a cache line replacement polic y biased to w ards reliability The dirty cache blocks which ha v e no duplicates in the memory hierarchy are selected for replacement on a cache miss, thus implicitly increasing redundanc y and impro ving reliability In Section 7.3.2, we e xploit small data v alues in cache lines to increase redundanc y at the w ord le v el and hence further impro v e reliability of the L2 cache. 85 PAGE 97 7.3.1 Reliabilitycentric Replacement P olicy The con v entional cache line replacement policies aim at impro ving memory performance by reducing miss rates. The y are generally based on access history of cache lines such as recenc y and frequenc y of cache line accesses. F or e xample, LR U (Least Recently Used), MFU (Most Frequently Used), and FIFO (First In First Out) use recenc y or frequenc y information. The cache line replacement polic y can be adapted to impro v e the reliability of the L2 cache. In addition to recenc y and frequenc y information, we can also include dirtiness of the cache blocks in the process of selecting a victim cache line. If a dirty cache line is chosen as a victim, the number of dirty cache lines in the L2 cache per c ycle will reduce and, thus, the vulnerability of the L2 cache will reduce. As blind cache line replacements may af fect performance adv ersely a hybrid replacement polic y has been de v eloped by combining the con v entional LR U polic y with the dirtinessbased replacement polic y When there is no dirty cache line in the accessed set of the L2 cache line, the LR U cache line is replaced. When the LR U cache line is clean and a ne xt LR U cache line is dirty the ne xt LR U line is selected as a victim. Only the LR U replacement polic y is considered when the number of dirty blocks in the L2 cache is belo w a vulner ability thr eshold The estimated number of dirty cache lines, ED, deri v ed from the probabilistic model discussed in Section 4.1 is used to determine the vulnerability threshold, V T as follo ws: V T*K V ED N (7.4) where K V is a user dened constant and N is the total number of blocks in the cache. Thus, the vulnerability threshold depends on the tar get application w orkload, which in our case is the SPEC2000 benchmarks, while a user dened softerror b udget can be specied by controlling K V Thus, using the probabilistic model, a v erage number of vulnerable blocks can be estimated based on the cache design parameters and therefore can be used to set the vulnerability threshold appropriately The probabilistic formulation decouples vulnerability which is a characteristic of the application and the cache architecture, from the soft error rate which is characteristic of the en vi86 PAGE 98 ronment in which the system is operating. Performance can also be traded for higher reliability of the L2 cache by controlling K V Algorithm 5 The Algorithm for L2cache access for Multibit Soft Error Protection if CA CHE HIT then if cmd == WRITE then if v alue generated is small then set the corresponding small v alue bit /* Small V alue Detection */ end if if matched block in setaddress(addr).dir tybit == TR UE then setaddress(addr).written b it (NMW bit) = TR UE end if setaddress(addr).dirt yb it = TR UE end if else if No. of dirty blocks_T otal No. of blocksV T then /* Use LR U replacement */ else Select a Block for replacement such that setaddress(addr).dirt yb it == TR UE and setaddress(addr).writte nb it (NMW bit) == F ALSE and setaddress(addr).inc lus ion b it == F ALSE If other blocks in this set are found with this property write these to lo wer le v el as well /* Clustered Cleaning */ end if end if /* Maintain inclusion property */ The hybrid replacment polic y is supported by the addition of a bit per cache line called No More Write (NMW). Generational beha vior of cache lines is e xploited by using the NMW bit [88, 89]. Generational beha vior of cache states that, cache lines are brought in from the main memory on cache misses, used frequently for a short period of time, and, then, not used (dead) until the y are e victed by another cache miss. The NMW bit in a cache line is maintained using the follo wing algorithm. The NMW bit is reset to 0 when an L2 cache line is brought into the L2 cache. When the cache line is written more than one time, its NMW bit is set to 1, indicating that the y are lik ely to be modied soon. NMW bits of L2 cache lines are reset to zero periodically resembling the popular CLK algorithm implemented to maintain LR U bits. Thus the NMW bits acts as a 1bit pr edictor of whether the cache line will be written soon. V ulnerable cache lines which are dirty b ut ha v e their 87 PAGE 99 NMW bit 0 are in their dead write time and can be cleaned and made nonvulnerable. The LR U bits along with the NMW bit are used for selecting the victim cache line to be replaced so that the cache line are close to (or already in) their dead time. The cache lines with their NMW bit set will lik ely to be written onto v ery soon and thus will be vulnerable again if cleaned. If the prediction is incorrect, i.e., cache line has not yet reached its dead time b ut has a NMW bit 0 (and becomes a candidate for e viction), the cache block will suf fer a cache miss, thus causing a performance penalty The hybrid replacement polic y can be e xtended to further impro v e the reliability of the L2 cache by cleaning other dirty cache lines on a replacement. When there are dirty cache lines in the same set as that of the replaced cache line and the y are e xpected not to be modied for a long time, the y can be cleaned together with the victim cache block. This will not increase the cache miss rate b ut can mak e the L2 cache more immune to errors by reducing the a v erage number of dirty cache lines per c ycle. When an L2 cache line is replaced, the other lines in the same set are also check ed for their NMWs. The cache lines with their NMW bits set to 0 are written back together to the main memory since the lines are not lik ely to be modied soon. If this cluster ed cleaning of dirty cache lines is accurate, i.e., the lines will not be modied for a f airly long time and then replaced, the vulnerability of the L2 cache will be reduced and there will be no performance penalty 7.3.2 Exploiting Small Data V alue Size It is commonly kno wn that a lar ge percentage of memory v alues are small [90, 91, 93]. Small memory v alues use at most half of the memory w ord bits. These small memory v alues can be e xploited to increase redundanc y and impro v e the reliability of the L2 cache. The small memory v alues can be duplicated in their upper half of memory w ord bits, which increases the de gree of redundanc y in the L2 cache. If the v alue of the memory w ord is small, a detected multibit error in the lo wer half bits can be corrected by using the duplicate data found in the top half bits. T o implement the duplication of small memory v alues, each memory w ord requires a small v alue bit for indicating that the v alue stored in the w ord is small and, thus, duplicated in the upper half bits of the memory w ord. The area o v erhead due to the duplication is the same as that of a parity bit: one bit for each memory w ord. 88 PAGE 100 Detector Small Value Circuit Error Protection Read/Write Buffer HSB LSB IDD P MUX3 in lower half is corrupted If small value and original data Data Out L H H L H '0' AND2 MUX1 MUX2 H L Data In Figure 7.5 Hardw are Architecture for Small V alue Detection and Duplication The tasks of detecting, duplicating, and unduplicating small memory v alues in the L2 cache require hardw are o v erhead. Detecting small memory v alues can be performed by adding zero detectors that can check the upper half bits of memory w ord. Duplicating memory v alues can be done with multiple x ers that can select between the lo wer half bits and the upper half bits of the memory v alues for the upper half bits of the results. Similarly unduplicating small memory v alues can be performed with multiple x ers that can select between zeros and upper half bits of the memory v alues for the upper half bits of results. When the duplicate bit from the L2 cache is 1, zero is selected as the output of the multiple xor A typical hardw are architecture for this scheme is sho wn in Figure 7.5. The tasks of zero detection, duplication and unduplication are performed between the L2 cache and the main memory to augment L2 cache line llings and replacements, and between the L1 data cache and the L2 cache to support writethrough requests from the write b uf fer An outline of the cache line access/replacement algorithm to control or mine redundanc y presented in this section is pro vided in Algorithm 5. It should be noted that status bits lik e additional dirty bits, small v alue bits and NMW bits are also themselv es vulnerable to soft error strik es. Ho we v er we ha v e assumed that such status bit are specially designed using radiation hardened latches [46]. Radiation hardened latches or SRAMs add a slight area o v erhead compared to re gular storage structures. Ho we v er as 89 PAGE 101 sho wn in our e xperimental results section, the o v erall area o v erhead for using such schemes in lar ge caches is minimal. 7.4 Experimental Setup and Results T able 7.1 Description of the Schemes Used in Experiments Scheme Description Baseline Con v entional L2 cache I Exploit inclusion property IM Add multiple dirty bits Exploit inclusion property D Replace a dirty cache line with NMW bit 0 DC Replace a dirty cache line with NMW bit 0 Clean dirty cache lines with NMW bit 0 in the same set IDCT1 Exploit inclusion property Replace a dirty cache line with NMW bit 0 Clean dirty cache lines with NMW bit 0 in the same set Enabled when the L2 cache vulnerability is higher than V T with V T =0.25 IDCT2 Exploit inclusion property Replace a dirty cache line with NMW bit 0 Clean dirty cache lines with NMW bit 0 in the same set Enabled when the L2 cache vulnerability is higher than V T with V T =0.1 IMDC Exploit inclusion property Add multiple dirty bits Replace a dirty cache line with NMW bit 0 Clean dirty cache lines with NMW bit 0 in the same set IMSDC Exploit inclusion property Add multiple dirty bits Duplicate small memory v alues Replace a dirty cache line with NMW bit 0 Clean dirty cache lines with NMW bit 0 in the same set Enabled when the L2 cache vulnerability is higher than V T with V T =0.25 In this section, we describe our e xperimental setup and the results of the v arious schemes proposed in the paper for impro ving the reliability of the L2 cache. T able 7.1 summarizes the v arious schemes that we ha v e e xperimented in our simulations. 90 PAGE 102 In the table, the scheme termed 'Baseline' indicates our baseline processor conguration without the inclusion of an y of the proposed schemes. The sceheme termed 'I' in the table e xploits inclusion property The scheme termed 'IM' emplo ys multiple dirty bits for each cache line along with e xploiting inclusion property The scheme termed 'D' (Dirty Line First) implements our reliabilitycentric replacement polic y The scheme termed 'DC' in the table adds clustered cleaning to clean dirty cache lines that are not lik ely to be modied soon, along with the reliablity centric replacement polic y The scheme termed 'DCIT1' e xploits inclusion property along with the reliablity centric replacement polic y and clustered cleaning with a vulnerablity threshold of 25%. The scheme 'DCIT2' is the same as scheme 'DCIT1' b ut instead uses a threshold of 10%. In the table, the scheme termed 'IMDC' emplo ys multiple dirty bits for each cache line along with e xploiting inclusion property and supporting a reliability centric replacement polic y with clustered cleaning. The scheme termed 'IMSDC' e xploits all proposed schemes together ha ving multiple dirty bits, detecting and duplicating small v alues, maintaining inclusion property along with the reliablity centric replacement polic y and clustered cleaning with a vulnerablity threshold of 25%. 7.4.1 Experimental Setup W e modied the SimpleScalar v ersion 3 tool suite [94] for this study Since we tar get high performance embedded processor and/or desktop processors, our baseline processor models an outoforder four issue superscalar processor with a split transaction memory b us. T able 7.2 summarizes the simulation parameters of this processor Since SimpleScalar models a write back L1 cache, we modied SimpleScalar to support a writethrough L1 cache. W e also implemented a mer ging write b uf fer with fully associati v e 8 entries between the L1 and L2 cache and each entry of the b uf fer contained four w ords. Inclusion property is maintained between L1 and L2 caches. When an L2 cache line with its inclusion bit set to one is replaced, the corresponding cache line in the L1 cache is in v alidated to maintain inclusion property The replacement polic y for the L2 cache can be easily e xtended to implement reliabilitycentric replacement; we only add an NMW bit for each L2 cache line and a nite state machine for the replacement function is modied to tak e into account dirtiness, the NMW bit, and the inclusion bit of the cache line. If the number of dirty cache 91 PAGE 103 T able 7.2 Baseline Processor Conguration P arameter Conguration Issue windo w 64entry R UU 32entry LSQ Decode, issue and commit rate 4 instructions per c ycle Functional 4 INT add, 1 INT mult/di v units 1 FP add, 1 FP mult/di v L1 instruction cache 16KB 4w ay 32B line, 1c ycle L1 data cache 16KB 4w ay 32B line, 1c ycle L2 cache unied 256KB, 4w ay 32B line, 10c ycle Main memory 8Bwide, 100c ycle Branch prediction 2le v el, 2K BTB, 32entry RAS Instruction TLB 64entry 4w ay Data TLB 128entry 4w ay lines is lar ger than dirtiness threshold, the reliabilitycentric replacement polic y is enabled while the con v entional LR U polic y is used otherwise. Small v alues are detected dynamically and maintained using a small v alue bit. Multiple dirty bits for each cache line are maintained to implement ne grain dirtyness. Our simulations ha v e been performed with a subset of SPEC2000 benchmarks [95]. These were compiled with DEC C V5.9008, Compaq C++ V6.2024, and Compaq FOR TRAN V5.3915 compilers using high optimization le v el. Eight programs from each of oatingpoint and inte ger benchmarks are randomly chosen for our e v aluation. All the benchmarks were f astforw arded for one billion instructions to a v oid initial startup ef fects and then simulated for another one billion committed instructions. W e also simulated for another one billion instructions after f ast forw arding 10 billion instructions. F or all simulations, the reference input sets were used. 7.4.2 Simulation Results W e measure the vulnerability of the L2 cache by computing the a v erage number of dirty blocks per c ycle without an y duplicates in the memory hierarchy Figures 7.6 and 7.7 present vulnerabilities of the L2 cache for v arious schemes we ha v e proposed in T able 7.1 including the baseline cache. V ulnerability of the L2 cache for the baseline conguration is 64.6% and 61.4%, on the a v erage, for the oatingpoint and inte ger benchmarks, respecti v ely The mesa gcc and gzip benchmarks sho w higher than 90% vulnerability Scheme 'I' reduces vulnerability to 61.4% and 58%, on the a v erage, 92 PAGE 104 Figure 7.6 V ulnerability of the L2 Cache for V arious Schemes Proposed for SPECINT2000. Figure 7.7 V ulnerability of the L2 Cache for V arious Schemes Proposed for SPECFP2000. 93 PAGE 105 for the oatingpoint and inte ger benchmarks, respecti v ely These percentages are 53.9%, 43.4%, 41%, and 39.5%, on the a v erage, for schemes 'D', 'DC', 'IDCT1', and 'IDCT2', respecti v ely for the oatingpoint benchmarks. These percentages are 51.3%, 43.1%, 40.6%, and 38.3% for the inte ger benchmarks. The maximum benet from scheme 'I' is limited to 6.25% since at most 16KB of dirty data can be redundant between the 16KB L1 data cache and the 256KB L2 cache in our baseline processor conguration. Scheme 'D' does not sho w good results when baseline vulnerability is high. F or e xample, in mesa applu gcc and gzip scheme 'D' sho ws small reductions in vulnerability This is because, in these benchmarks, most of cache lines are dirty and, thus, there is little dif ference between our reliability based replacement and the LR U policies. In contrast, scheme 'D' w orks well with ammp ; vulnerability of the L2 cache reduces to 26.2% from 82.9% in the baseline. Since L2 cache miss rate is v ery high (28.8%) and, thus, IPC is v ery lo w (0.1) in ammp cache lines remain dirty when pipelines are stalled for a long time due to the L2 cache misses, increasing vulnerability per c ycle. Scheme 'D' mak es those dirty cache lines nonvulnerable by e victing them from the L2 cache, reducing vulnerability per c ycle. Scheme 'DC' consistently reduces vulnerability by 10.5% and 8.3%, on the a v erage, o v er and abo v e scheme 'D'. Scheme 'DC' w orks v ery well for mesa and par ser in which scheme 'D' w as not ef fecti v e in reducing the vulnerability Scheme 'IDCT1' reduces additional 2.4% and 2.5% of vulnerability for the oatingpoint and inte ger benchmarks, respecti v ely A vulnerability threshold of 10% further reduces the vulnerability of the L2 cache. The vulnerability reduces by 1% and 2.3% for the oatingpoint and inte ger benchmarks, respecti v ely The ammp bzip2 and cr afty benchmarks benet most from 10% threshold. The ne grain dirtiness based method w as implemented by ha ving four dirty bits per cache line for a cache line size of four w ords. Scheme 'IM' reduces the vulnerability of the L2 cache to 43% and 39.6%, on the a v erage, for the oatingpoint and inte ger benchmarks, respecti v ely Reductions in vulnerability are 18% and 21%, respecti v ely compared to 'Baseline'. Scheme 'IM' is comparable to scheme 'IDCT1' in reducing vulnerability as can be observ ed in Figures 7.6 and 7.7. In most oatingpoint benchmarks, scheme 'IM' sho ws better results than scheme 'IDCT1'. Only mesa and galg el sho w w orse results with scheme 'IM'. Half of the inte ger benchmarks sho w better 94 PAGE 106 Figure 7.8 Global Miss Rates of the L2 Cache for V arious Schemes Proposed for SPECINT2000. Figure 7.9 Global Miss Rates of the L2 Cache for V arious Schemes Proposed for SPECFP2000. 95 PAGE 107 results and the other half sho w w orse results with scheme 'IM'. Scheme 'IMDC' further reduces the vulnerability to 33.5% and 32.4%, on the a v erage, for the oatingpoint and inte ger benchmarks, respecti v ely The applu mgrid gzip and par ser benchmarks sho w lar ge reductions in vulnerability with scheme 'IMDC' compared to scheme 'IM'. As sho wn in the gures, we ha v e also e xperimented with our proposed scheme to e xploit small memory v alues. Our combined optimization scheme 'IMSDC' reduces L2 cache vulnerability by 40% on the a v erage for the inte ger benchmarks. Floating point benchmarks sho w a lesser decrease in vulnerablity of about 32%, primarily because the oating point v alues include a sign bit, e xponent and mantissa elds and hence cannot be detected by the small v alue detector As discussed later with the signicant reduction of the vulnerability by e xploting/mining redundanc y and with the addition of a small direct mapped ECC cache, for the error correction of the vulnerable blocks, an a v erage multibit error co v erage of about 96% can be achie v ed with our approach. As discussed pre viously the NMW bit pro vides a 1 bit prediction for whether the cache line will be written soon. W e also e xperimented with 2 bit predictors b ut we did not notice an y signicant changes from the 1 bit predictor case. W e do not sho w these results here for bre vity W e also measured L2 cache miss rate change since our proposed schemes use either the con v entional LR U polic y or the proposed reliabilitycentric polic y depending on vulnerability of the L2 cache at a particular time and the chosen vulnerability threshold. Figures 7.8 and 7.9 present L2 cache miss rate for v arious schemes proposed in this paper W e use global cache miss rate in the gure. Cache miss rates increase by 0.4%, 0.1%, 0.1%, and 0.4%, on the a v erage, for schemes 'D', 'DC', 'IDCT1', and 'IDCT2', respecti v ely for the oatingpoint benchmarks. These percentages are 12.6%, 10.7%, 10.7%, and 10.7% for the inte ger benchmarks. The gcc benchmark sho ws a decrease in miss rate, which demonstrates that the con v entional LR U polic y is not optimal for all benchmarks. As sho wn in the gure, the missrates reduces signicantly when the replacement polic y is changed from LR U replacement polic y to a replacement polic y f a v oring replacement of dirty lines. W e note that replacement scheme based on LR U polic y is based on the approximation that the least recently used block will not be used in the near future. As we also select that dirty cache line in the set to replace which is oldest in terms of LR U, our simulations sho w that the replacement scheme using 96 PAGE 108 Figure 7.10 IPCs for V arious Schemes Proposed for SPECINT2000. such a technique predicts cache lines in their dead time v ery accurately and hence has a signicantly lo wer miss rate compared LR U. Figures 7.10 and 7.11 plots IPC results for v arious schemes proposed in this paper IPC reductions are 0.2%, 0.2%, 0.3%, and 0.3%, on the a v erage, for schemes 'D', 'DC', 'IDCT1', and 'IDCT2', respecti v ely for the oatingpoint benchmarks. These percentages are 0.1%, 0.1%, 0.1%, and 0.1% for the inte ger benchmarks. IPC reduces slightly due to additional write back traf c in our schemes. The gcc benchmark sho ws IPC increase of 25% for scheme 'D'. The benchmark sho wed high miss rate reduction in Figure 7.8 which translated directly into impro v ed performance. The other benchmarks sho w slight decreases or increases in IPC. Our proposed scheme, especially 'IDCT1', reduces vulnerability by 23.6%, on the a v erage, for the oatingpoint benchmarks with 0.3% performance penalty F or the inte ger benchmarks, vulnerability reduces by 23.1%, on the a v erage, with less than 0.1% performance loss. Since our replacement policies f a v or dirty cache lines, we also measured the write back traf c rate from the L2 cache to the main memory as sho wn in Figures 7.12 and 7.13. The write back traf c rate is measured as the ratio of the number of writes from the L2 cache to all L2 cache accesses. The write back traf c is increased by 1.1% and 191.7%, 2.5%, and 2.8%, on the a v erage, for schemes 'D', 'DC', 'IDCT1', 'IDCT2', respecti v ely the oatingpoint benchmarks. These percentages are 10.8%, 163.3%, 0.9%, and 0.8% for the inte ger benchmarks. Scheme 'DC' increases memory write traf c signicantly since it performs clustered cleaning of dirty cache lines. In contrast, scheme 97 PAGE 109 Figure 7.11 IPCs for V arious Schemes Proposed for SPECFP2000. 'IDCT1' sho ws little dif ference in write back traf c since it tak es inclusion bits into account. Since redundant cache lines between L1 and L2 caches are most acti v e cache lines, the y are lik ely to be modied frequently Cleaning these redundant cache lines does not help reduce vulnerable cache lines in scheme 'DC'. Scheme 'IDCT1' does not clean redundant cache lines since the y are highly lik ely to be written soon. Schemes 'D' and 'IDCT1' e v en decrease memory write back traf c for the inte ger benchmarks mainly because of gcc where cache miss rate decreases signicantly for schemes 'D' and 'IDCT1', which reduces dirty write backs from the L2 cache. 'IDCT2' sho ws a similar beha vior to 'IDCT1'. As pre viously discussed, we assumed that a small ECC cache is maintained for error correction of the vulnerable cache blocks, i.e., those dirty cache blocks that ha v e no duplicates in the memory hierarchy The multibit error correction codes for only the vulnerable blocks are maintained in this small ECC cache. A multibit soft error is al w ays detected by the lo w cost error detection codes. If a L2 cache block is nonvulnerable, it is corrected by e xploiting the redundanc y e xisting in the memory hierarchy while vulnerable blocks are corrected using the small ECC cache. The signicant reduction in vulnerability of the L2 cache by e xploiting/mining the redundanc y in the memory hierarchy allo wed a ECC cache of signicantly smaller size. W e found that a directmapped ECC cache, of size 8KB w as suf cient for upto 6bit error protection using our redundanc y based approach for most SPEC2000 benchmarks on 256KB L2 cache. Our simulations suggest that with a ECC cache of size 4% of that of the L2 cache together with e xploiting our redundanc y based 98 PAGE 110 Figure 7.12 Write Back T raf c Rate to the Main Memory for V arious Schemes Proposed for SPECINT2000. Figure 7.13 Write Back T raf c Rate to the Main Memory for V arious Schemes Proposed for SPECFP2000. approach, can pro vide a multibit error error co v erage of about 96% with signicatly less area/po wer o v erhead and with mar ginal performance penalty W e estimated the area o v erhead of our redundanc y based scheme with a small ECC cache for multibit error protection of the vulnerable blocks. W e estimated the area o v erhead of multibit error correction coding by using the follo wing formulation. As codes obtained by multibit errors from a v alid code w ord must be disjoint from each other for correction to a distinct v alid code, we ha v e for pbit error correction scheme for a mbit w ord requiring r check bits,N`1ba2 mc2 mdr (7.5) 99 PAGE 111 Figure 7.14 Area Ov erhead for a L2 Cache with Redundanc y Based Error Protection Compared to a Baseline L2 Cache without Error Protection where N is number of possible w ays a pbit error can happen on a mbit w ord. Since this is the same as the number of w ays of choosing 1 2 ep objects from m`r objects, N*C mdr 1`C mdr 2`feeC mdr p (7.6) Solving these equations for r gi v es us an estimate of area o v erhead for complete multibit error protection. The area o v erhead for our redundanc y based multibit error correction approach w as estimated by considering the area o v erhead for the small ECC cache and adding the number of status bits (inclusion bit, small v alue bits etc.) required for implementing our redundanc y based approach. The area o v erhead of our redundanc y based multibit error protection for x ed number of ECC cache blocks is sho wn in Figure 7.14. As sho wn in the gure, our redundanc y based multibit error protection for 6bit errors on a L2 cache with line size of 32B incurs only a mar ginal area o v erhead of 6%. W e also estimated the po wer o v erhead of our redundanc y based multibit error protection. W e ported the Simplescalar based frame w ork implementing our approach to the W attch 2.0 [96] framew ork. W attch performs architectural le v el po wer analysis for the cache by maintaining counters for number of read/write accesses to the cache and multiplying it with the a v erage po wer required for a single read/write cache access in a particular process technology W e also estimated the leakage po wer at the architectural le v el, which is a signicant portion of total po wer in current technology 100 PAGE 112 Figure 7.15 A v erage Dynamic Po wer Consumption for a L2 Cache with a 8KB ECC Cache Compared with Baseline L2 Cache without Error Protection nodes. W e used CA CTI 4.2 [87] for this analysis, which is a detailed cache access and po wer analysis tool. The cache size estimates made pre viously w as pro vided to CA CTI to obtain estimates of leakage po wer W ith a 70nm process technology model, the dynamic po wer o v erhead of our redundanc y based multibit error correction scheme with dif ferent SPEC2000 benchmark circuits is plotted in Figure 7.15. As sho wn in the gure our scheme increases the dynamic po wer o v erhead by only about 13.75%. As sho wn in Figure 7.16, mar ginal o v erhead is also incurred in leakage po wer for our area ef cient multibit error correction scheme for dif ferent sized multibit errors. This, thus mak es the total po wer o v erhead of our approach much smaller than that of most w orks found in literature for moderately sized multibit errors. 7.5 Comparison with Related W orks W e note that man y competing solutions ha v e been proposed in the literature for protecting caches against multibit errors with lo w area/performance o v erhead. F or comparison of our w ork with recent w orks, we ha v e used data reported in the results of the corresponding research papers and interpolated the results according to our simulation setup. W e ha v e assumed an a v erage IPC of 2.5 and that the instruction mix contains 30% memory reference instructions and 10% stores as is typical of most SPEC benchmarks [95]. 101 PAGE 113 Figure 7.16 A v erage Leakage Po wer Consumption for a L2 Cache with Small ECC Cache with Fix ed Number of Blocks for Dif ferent Sized Multibit Errors InCache Replication (ICR), has been proposed in [19] to e xploit dead blocks in the data cache to store the replicas of the hot blocks. These duplicate blocks can be used to correct multibit errors in the acti v e blocks. Although an area o v erhead of less than 1% has been reported with a modest performance penalty of 3.6%, the parity based multibit error protection scheme pro vides an error co v erage of only 65%. Our redundanc y based approach has a error co v erage of 96% with a performance penalty of less than 1%. Shado w Caching [18], maintains copies of MFU (Most Frequently Used) cache lines in separate shado w caches. In the conte xt of error correction, atleast tw o shado w caches should be maintained to support correction using majority v oting. The approach although signicantly better than blind NMR (NModular Redundanc y), ho we v er incurs signicant area o v erhead. F or e xample a 4w ay associati v e shado w cache of 128 entries has an area o v erhead of about 28%. Also, as multiple copies of data are read for error detection the cache access latenc y is increased, resulting in a performance o v erhead of about 40% with a modest error co v erage of about 85%. These o v erheads scale e xponentially as higher order mltiplebit errors are considered. In comparison, our redundanc y based approach can achie v e 96% error co v erage with about 6% area o v erhead with v ery little performance penalty The Rcache aprroach [21], maintains a small fully associati v e replication cache to detect and correct multibit errors using copies of dirty data. The method achie v es 100% multibit error co v erage. Ho we v er as multibit error detection is achie v ed by parallel access of the Rcache and 102 PAGE 114 the data cache, a lar ge latenc y o v erhead is incurred. F or e xample, with a 2 c ycle load latenc y for reads as reported in the w ork, a performance penalty of about 7.31% can be incurred. As illustrated in Section 4.4, the performance penalty for our redundanc y based multibit error protection scheme is less than 1% with high multibit error co v erage. The Last Store prediction technique [23], proposes the use of an accurate programcounter (PC) trace based predictor which immediately initiates a writeback after observing a PC trace with a sequence of store instructions. Ho we v er the hardw are structures lik e history table and signature table incur an area o v erhead of about 8%. Assuming that updating these hardw are structures tak es atleast 1 c ycle latenc y during stores, a per formance penalty of about 15% can be incurred which is quite high compared to our redundanc y based approach. Our approach also achie v es a higher error co v erage than that reported in their w ork. Punctured Error Reco v ery Cache (PERC) [86], decouples error detection and correction by maintaining the punctured error correction codes in a separate cache. As error detection and correction is separated, little performance o v erhead is incurred. Ho we v er as the number of vulnerable blocks is not acti v ely reduced, complete multibit error co v erage requires about 19% area o v erhead. T able 7.3 Comparison with Recent W orks in Literature Scheme Ar ea o v er head P erf ormance P enalty Err or Co v er age ICR [19] g1% 36% 65% Shado w Cache [18] 28% 40% 85% Last Store Prediction [23] 798% 15% 88% Rcache [21] 10% 731% 100% PERC [86] 19% g1% 100% This w ork 6% g1% 96% W e note that the techniques proposed in this w ork, although primarily tar geted at single core processors, ho we v er can be e xtended and applied to multicore processors. The bandwidth required by the writethrough L1 cache used in our approach can be signicantly reduced by emplo ying a mer ging writeb uf fer between the L1 and the L2 caches. F or e xample, when a fully associati v e mer ging write b uf fer with 8 entries and with each entry of the size of four w ords is placed between the L1 and the L2 cache, a ne gligible increase of writethrough bandwidth is observ ed. Also, the 103 PAGE 115 techniques to mine redundanc y using small v alues and reliablity centric replacement can be applied to a cache hierarchy with noninclusi v e L1 caches. In multicore systems, the cache coherence protocol between the local L1 caches and a shared L2 cache, which also acts as a synchronization point, can lead to increased e xploitation of the inclusion property between the L1 and the L2 cache. A cache read of data in the local L1 caches of a processor that has been modied by another processor in a multicore en vironment will lead to in v alidation requests by the cache controller and refetching of the modied data from the shared L2 cache. As inclusion property is naturally enforced between the L1 and the L2 caches, this leads to increased redundanc y between local L1 caches and shared L2 caches. 104 PAGE 116 CHAPTER 8 CONCLUSIONS Aggressi v e scaling trends ha v e signicantly impacted the susceptibility of nanometer designs to transient f aults. T ransient f aults occur due to se v eral reasons, such as soft errors, po wer supply and interconnect noise, and electromagnetic interference. Soft errors occur when the ener getic neutrons coming from space or the alpha particles arising out of packaging materials hit the transistors. A soft error may manifest itself as a bit ip in a latch or memory element. Additionally soft errors can occur in an y internal node of a combinational logic and subsequently propagate to and be captured in a latch. In this dissertatiom, we ha v e in v estigated the de v elopment of a unied design o w frame w ork for mitigation of soft errors. Se v eral design and circuit optimization techniques applicable at v arious le v els of hardw are design ha v e been be e xplored to impro v e the reliability of nanoscale VLSI systems. In chapter 3, we presented some preliminaries of soft errors in memory and logic circuits. Different circuit nodes in a logic circuit had dif ferent soft error criticality depending on the v arious masking ef fects. The masking ef fects depend on the circuit topology and the underlying cells realizing the logic circuit. T o w ards this, we de v eloped se v eral metrics that estimate the masking ef fects in logic circuits. W e sho wed that it is possible to accurately capture the soft error masking ef fects by using a ne w metric called the cumulative pr obability of observability (CPO) The metrics de v eloped in this chapter are e xtensi v ely used in chapters 56 for selecti v ely optimizing circuits nodes and nets. In chapter 4, we sho w that the interconnects realizing the signal nets can act as RC ladders and can ef fecti v ely lter out glitches due to random radiation strik es. W e le v eraged the f act that dif ferent circuit nodes can be quite dif ferent in their ability to mask soft errors based on the circuit topology and the logic cells of the circuit. Based on this, we ha v e de v eloped a SA based placement algorithm that places standard cells in a w ay to pro vide higher wirelengths for soft error critical 105 PAGE 117 nets while simultaneously constraining the chip area and the total wirelength. SA based placement schemes produce good reductions in SER b ut suf fer from lar ge runtimes. T o w ards this, we propose a f ast quadratic programming based standard cell placer which is orders of magnitude f aster than the SA based placement scheme with some loss in solution quality in terms of SER reduction and its associated delay and area o v erheads. Experimental results on ISCAS85 benchmark circuits using the FreePDK 45nm technology kit and the OSU standard cells indicate that our radiation immune placement algorithms can reduce the SER in logic circuits signicantly with lo w delay and area o v erheads. In chapter 5, we proposed a transistor le v el circuit which signicantly reduces the propagation of random glitches due to radiation strik es. The circuit is based on a RC dif ferentiator implemented in CMOS, which utilizes the e xponential v oltage spik e generated during a radiation strik e to detect occurrence of single e v ent transient (SET) and disconnects the dri ving cell from the dri v en cell. The high v oltage drop during the resistor (implemented using NMOS) is pro vided as gatetobody v oltages of tw o depletion mode NMOS and PMOS transistors arranged in series. During the high positi v e(ne gati v e) v oltage swing due to the SET the normally on depletion mode PMOS(NMOS) is cut of f disconnecting the dri v er from the dri v en cell. The circuit incurs some o v erhead in terms of area and delay T o w ards this, we de v elop an algorithm for selecti v e insertion of these radiation block er cells on critical circuit nodes. The algorithm is based on ranking circuit nodes based on a ne w metric called the Pr obability of Radiationbloc k er cir cuit Insertion(PRI) and inserting the radiation block er cells on the top fe w nodes in the sorted list of PRI v alues. The PRI metric is computed by considering the product of the glitch observ ability of a node and the slack a v ailable at that node. Thus, the algorithm inserts the radiation block er cells selecti v ely on highly soft error vulnerable nodes for the noncritical paths of a circuit. W e e xperimented with the proposed frame w ork using the FreePDK 45nm Process Design Kit and the Nangate standard cell library based on the 45nm technology Experimental results indicate that our methodology can reduce the SER in logic circuits by as much as 51% with area o v erheads of about 18% and delay o v erhead of only 0.2%. In chapter 6, we de v eloped a reliabilitycentric gate sizing formulation that jointly optimizes the circuit against both radiation induced soft errors and capaciti v e crosstalk noise under process 106 PAGE 118 uncertainty A rst order model is de v eloped for soft errors in logic gates by only considering the ef fect of the size of a gate and the sizes of the gates in the transiti v e f anin of the gate. Based on this, we ha v e de v eloped a f ast and accurate method for optimizing SER during gate sizing. Crosstalk noise is modeled by clustering the structural netlist based on Rent' s e xponent v alues and by equalizing the dri v e strengths of all cells in a cluster T iming yield loss due to process v ariations are optimized by maximizing the delay v ariance for each gate. These models are incorporated along with delay and po wer metrics to de v elop a reliabilitycentric gate sizing technique based on mathematical programming. Finally in chapter 7, we modelled the vulnerablity of the L2 caches due to multibit errors using a probabilistic formulation characterized by e xtensi v e simulations for multibit errors in v arious L2 cache or ganizations. Based on this study we proposed a frame w ork of solutions based on redundanc y for the correction of multibit soft errors. In our approach, simple error detection codes lik e Hamming distance or Cyclic Redundanc y Codes (CRC) are used to detect the multiplebit errors, and the y are corrected using the redundanc y e xisting in the memory hierarchy W e demonstrate that multibit errors in the L2 cache can be corrected by e xploiting the redundanc y e xisting between the the writethrough L1 cache and the L2 cache and the redundanc y e xisting between the clean data lines of the L2 cache and the main memory W e found that the bandwidth and po wer requirement of the writethrough L1 cache can be suf ciently reduced by addition of a small mer ging write b uf fer between the L1 and L2 cache. W e in v estigated methods to increase the amount of redundanc y in the memory hierarchy by emplo ying a redundanc ybased replacement polic y the amount of redundanc y being controlled is based on a redundanc y threshold which is estimated using our probabilistic model. Finally we in v estigated ho w redundanc y can be mined at the w ord le v el by duplicating small memory v alues in the upper half of the memory w ord. Multibit errors in the lo wer half of the w ord is corrected using the duplicate cop y in the upper half. The multibit errors which cannot be corrected using the inherent redundanc y are corrected by using a small ECC cache. Thus, in this dissertation, we e xplored techniques at all le v els in the design o w to impro v e the vulnerability of VLSI systems against soft errors without compromising on other design metrics lik e delay area and po wer The design techniques, algorithms and architectures can inte grated 107 PAGE 119 into e xisting design o ws and prototype chips can implemented on a reliable VLSI System. The implementation can le v erage on the architectural solutions for the caches while the custom hardw are synthesized for the VLSI System can utilize the v arious circuit optimization algorithms that are de v eloped at v arious design abstraction le v els. 108 PAGE 120 REFERENCES [1] DC Bossen, JM T endler and K. Reick. Po wer4 system design for high reliability In IEEE Micr o v olume 22, pages 1624, 2002. [2] N. Quach. High a v ailability and reliability in the itanium processor In IEEE Micr o v olume 20, pages 6169, 2000. [3] R. Phelan. Addressing soft errors in arm corebased soc. In ARM White P aper Dec 2003. [4] KC Y eager The mips r10000 superscalar microprocessor In IEEE Micr o v olume 16, pages 2841, 1996. [5] S.S. Mitra, N.M.Z.Q.S. K ee, and S. Kim. Rob ust system design with b uiltin softerror resilience. In IEEE Computer v olume 38, pages 4352, 2005. [6] G. Hinton, D. Sager M. Upton, D. Boggs, D. Carmean, A. K yk er and P Roussel. The microarchitecture of the pentium 4 processor In Intel T ec hnolo gy J ournal v olume 1, 2001. [7] P Hazucha and C. Sv ensson. Impact of cmos technology scaling on the atmospheric neutron soft error rate. In T r ans. on Nuclear Science v olume 47, pages 25862594, 2000. [8] T Karnik, B. Bloechel, K. Soumyanath, V De, and S. Borkar Scaling trends of cosmic ray induced soft errors in static latchesbe yond 0.18 In Dig est of Symp. on VLSI Cir cuits pages 6162, 2001. [9] N. Seifert, D. Mo yer N. Leland and R. Hokinson. Historical trend in alphaparticle induced soft error rates of the alpha microprocessor In Intl. Reliability Physics Symposium pages 259265, 2001. [10] C. Constantinescu. T rends and challenges in vlsi circuit reliability In IEEE Micr o v olume 23, pages 1419, 2003. [11] N. Jha and S. K undu. T esting and Reliable Design of CMOS cir cuits Kluwer Academic Publishers, 1990. [12] H. Asadi, V Sridharan, MB T ahoori, and D. Kaeli. Balancing performance and reliablity in the memory hierarchy In Pr oc. of Symp. on P erformance Analysis of Systems and Softwar e pages 269279, 2005. [13] A. Bisw as, P Racunas, R. Che v eresan, J. Emer S.S. Mukherjee, and R. Rangan. Computing architectural vulnerability f actors for addressbased structures. In Pr oc. of ISCA pages 532 543, 2005. 109 PAGE 121 [14] V Narayanan and Y Xie. Reliability concerns in embedded system designs. In IEEE Computer v olume 39, pages 118120, 2006. [15] CW Slayman. Cache and memory error detection, correction, and reduction techniques for terrestrial serv ers and w orkstations. In T r ans. on De vice and Materials Reliability v olume 5, pages 397404, 2005. [16] SS Mukherjee, J. Emer T F ossum, and SK Reinhardt. Cache scrubbing in microprocessors: myth or necessity? In Pr oc. of Intl. Symp. on Dependable Computing pages 3742, 2004. [17] W Zhang, M. Kandemir A. Si v asubramaniam, and MJ Irwin. Performance, ener gy and reliability tradeof fs in replicating hot cache lines. In Pr oc. of the Intl. Conf on Compiler s, ar c hitectur es and synthesis for embedded systems pages 309317, 2003. [18] S. Kim and A.K. Somani. Area ef cient architectures for information inte grity in cache memories. In Pr oc. of the ISCA 1999. [19] W Zhang, S. Gurumurthi, M. Kandemir and A. Si v asubramaniam. Icr: incache replication for enhancing data cache reliability In Pr oc. of Intl. Conf on Dependable Systems and Networks pages 291300, 2003. [20] T T anza w a, T T anaka, K. T ak euchi, R. Shirota, S. Aritome, H. W atanabe, G. Hemink, K. Shimizu, S. Sato, Y T ak euchi, et al. A compact onchip ECC for lo w cost ash memories. In J ournal of SolidState Cir cuits v olume 32, pages 662669, 1997. [21] W Zhang. Enhancing data cache reliability by the addition of a small fullyassociati v e replication cache. In Pr oc. of Intl. Conf of Super computing pages 1219, 2004. [22] V Sridharan, H. Asadi, MB T ahoori, and D. Kaeli. Reducing data cache susceptibility to soft errors. In T r ans. on Dependable and Secur e Computing v olume 3, pages 353364, 2006. [23] B.T Gold, M. Ferdman, B. F alsa, and K Mai. Mitigating multibit soft errors in l1 caches using laststore prediction. In Pr oc. of F eder ated Computing Resear c h Conf 2007. [24] J. Kim, N. Harda v ellas, K. Mai, B. F alsa, and Hoe J. Multibit error tolerant caches using tw odimensional error coding. In IEEE Micr o 2007. [25] K. Bhattacharya, S. Kim, and N. Ranganathan. Impro ving the reliability of onchip l2 cache using redundanc y In Pr oc. of the ICCD pages 224229, 2007. [26] Y Dhillon, A. Diril, and A. Chatterjee. Softerror tolerance analysis and optimization of nanometer circuits. Pr oc. of D A TE pages 288293, 2005. [27] G. Messenger Collection of char ge on junction nodes from ion tracks. T r ans. of Nuclear Science 29(6):20242031, 1982. [28] S. Mitra, T Karnik, N. Seifert, and M. Zhang. Logic soft errors in sub65nm technologies design and CAD challenges. Pr oc. of D A C pages 24, 2005. [29] R. Rajaraman, J. Kim, N. V ijaykrishnan, Y Xie, and M. Irwin. SEA T LA: A soft error analysis tool for combinational logic. Pr oc. of VLSID pages 499502, 2006. 110 PAGE 122 [30] R. Rao, D. Blaauw and D. Sylv ester Soft error reduction in combinational logic using gate resizing and ipop selection. Pr oc. of ICCAD pages 502509, 2006. [31] FreePDK 45nm T echnology Kit. http://www .eda.ncsu.edu/wiki/FreePDK. [32] P Shi v akumar M. Kistler S. K eckler D. Bur ger and L. Alvisi. Modeling the ef fect of technology trends on the soft error rate of combinational logic. Pr oc. of DSN pages 389398, 2002. [33] B. Zhang, W W ang, and M. Orshansk y F ASER: F ast analysis of soft error susceptibility for cellbased designs. T ime 1(66):210, 2003. [34] Nangate Standard Cell library http://www .si2.or g/openeda .si2 .or g /p roj ect s/n ang atel ib [35] M. Nicolaidis. T ime Redundanc y Based SoftError T olerance to Rescue Nanometer T echnologies. T r ans. on VTS 99:8694, 1999. [36] T Karnik, S. V angal, V V eeramachaneni, P Hazucha, V Erraguntla and S. Borkar Selecti v e node engineering for chiple v el soft error rate impro v ement. Pr oc. of Symp. On VLSI Cir cuits pages 204205, 2002. [37] J. K umar and M. T ahoori. Use of pass transistor logic to minimize the impact of soft errors in combinational circuits. Pr oc. of W orkshop on SELSE 2005. [38] Y Sasaki, and K. Namba, and H. Ito Circuit and Latch Capable of Masking Soft Errors with Schmitt T rigger J ournal of Electr onic T esting 24(1), pages 1119, 2008. [39] R. Gar g, N. Jayakumar S.P Khatri and G. Choi. A design approach for radiationhard digital electronics. Pr oc. of the D A C pages 773778, 2006. [40] K. Bhattacharya and N. Ranganathan. RADJ AM: A No v el Approach for Reduction of Soft Errors in Logic Circuits. Pr oc. of the VLSI Design pages 453458, 2009. [41] Y Sasaki, K. Namba and H. Ito. Soft Error Masking Circuit and Latch Using Schmitt T rigger Circuit. Pr oc. of the Symp. on DFT 327335, 2006. [42] K. Bhattacharya and N. Ranganathan. Reliabilitycentric Gate Sizing with Simultaneous Optimization of Soft Error Rate, Delay and Po wer Pr oc. of the ISLPED 99104, 2008. [43] N. Hanchate and N. Ranganathan. LECT OR: A T echnique for Leakage Reduction in CMOS Circuits. IEEE T r ans. on VLSI Systems 12(2), 196205, 2004. [44] K. Ro y and S. Prasad. Lo w Po wer CMOS VLSI: Circuit Design. W ile yInter science 2000. [45] J. Cazeaux, D. Rossi, M. Omaa, A. Chatterjee and C. Metra. On T ransistor Le v el Gate Sizing for IC Design Rob ust T o T ransient F aults. Pr oc. of Intl. OnLine T esting Symposium 2005. [46] C. Nagpal, R. Gar g and S. Khatri. A delayef cient radiationhard digital design approach using CWSP elements. Pr oc. of D A TE pages 354359, 2008. [47] S. Mitra, M. Zhang, S. W aqas, N. Seifert, B. Gill and K. Kim. Combinational logic soft error correction. Pr oc. of ITC pages 824832, 2006. 111 PAGE 123 [48] M. Choudhury Q. Zhou, and K. Mohanram. Design optimization for singlee v ent upset rob ustness using simultaneous dualvdd and sizing techniques. Pr oc. of ICCAD pages 204209, 2006. [49] N. Sherw ani. Algorithms for VLSI Physical Design Automation. Kluwer Acedemic Publisher s Boston, 1995. [50] V Mahalingam and N. Ranganathan. V ariation A w are T iming Based Placement Using Fuzzy Programming. Pr oc. of ISQED pages 327332, 2007. [51] C. Li, M. Xie, C. K oh, J. Cong, P Madden RoutabilityDri v en Placement and White Space Allocation IEEE T r ans. on CAD 26(5), 858871, 2007. [52] K. Bhattacharya and N. Ranganathan. A Ne w Placement Algorithm for Reduction of Soft Errors in Macro Cell based Design of Nanometer Circuits. Pr oc. of ISVLSI pages 9196, 2009. [53] H. Murata, K. Fujiyoshi, S. Nakatak e, and Y Kajitani. VLSI module placement based on rectanglepacking by the sequence pair IEEE T r ans. on CAD 15(12), pages 15181524, 1996. [54] X. T ang, R. T ian, and D. W ong. F ast Ev aluation of Sequence P air in Block Placement by Longest Common Subsequence Computation. Pr oc. of the D A TE pages 106111, 2000. [55] P Fernando and S. Katk oori. An Elitist NonDominated Sorting Based Genetic Algorithm for Simultaneous Area and W irelength Minimization in VLSI Floorplanning. Pr oc. of VLSI Design 337342, 2008. [56] C. Zhao, S. De y and X. Bai. SoftSpot Analysis: T ar geting Compound Noise Ef fects in Nanometer Circuits. T r ans. on Design and T est 362375, 2005. [57] J. Lou and W Chen. Cross talk dri v en placement. Pr oc. of ASPD A C 735740, 2003. [58] J. Stine, J. Grad, I. Castellanos, J. Blank, V Da v e, M. Prakash, N. Ilie v and N. Jachimiec, A Frame w ork for HighLe v el Synthesis of SystemonChip Designs. Pr oc. of MSE 1112, 2005. [59] C. Zhao, X. Bai and S. De y A scalable soft spot analysis methodology for compound noise ef fects in nanometer circuits. Pr oc. of D A C pages 894899, 2004. [60] V Jain and P Zark eshHa. Analytical NoiseRejection Model Based on Short Channel MOSFET Pr oc. of ISQED pages 401406, 2008. [61] I. P arulkar A. W ood, J. Hoe, B. F alsa, S. Adv e, J. T orrellas and S. Mitra. OpenSP ARC: An Open Platform for Hardw are Reliability Experimentation. Pr oc. of SELSE 2008. [62] S. Adya and I. Mark o v Fix edoutline Floorplanning Through Better Local Search. Pr oc. of ICCD pages 328333, 2001. 112 PAGE 124 [63] J. Kleinhans, G. Sigl, F Johannes and K. Antreich. GORDIAN: VLSI placement by quadratic programming and slicing optimization. IEEE T r ans. on CAD of Inte gr ated Cir cuits and System 10(3), pages 356365, 1991. [64] M. Galassi, J. Da vies, J. Theiler B. Gough, G. Jungman, M. Booth and F Rossi. GNU scientic library Network Theory Ltd. 2002. [65] L. Gaspero. QuadProg++. http://www .die gm.uniud.it/digaspero / inde x.php?page=softw are anddata. [66] J. Ro y D. P apa, S. Adya, H. Chan, A. Ng, J. Lu and I. Mark o v Capo: rob ust and scalable opensource mincut oorplacer Pr oc. of ISPD pages 224226, 2005. [67] M. Berk elaar and J. Jess. Gate sizing in mos digital circuits with linear programming. Pr oc. of ED A C pages 217221, 1990. [68] J. P Fishb urn and A. E. Dunlop. TILOS : A posynomial programming approach to transistor sizing. IEEE T r ans. on CAD pages 326336, 1985. [69] N. Hanchate and N. Ranganathan. Simultaneous interconnect delay and crosstalk noise optimization through gate sizing using game theory IEEE T r ans. on Computer s 55(8):10111023, 2006. [70] http://courses.ece.uiu c.edu /e ce5 43 /is cas 85 .html. ISCAS'85 benchmark circuits. [71] http://www neos.mcs.anl.go v/neos/so lv er s/c p:KNITR O/AMPL.html. KNITR OS solv er from neos serv er [72] K. Chopra, S. Shah, A. Sri v asta v a, D. Blaauw and D. Sylv ester P arametric yield maximization using gate sizing based on ef cient statistical po wer and delay gradient computation. Pr oc. of ICCAD pages 10231028, 2005. [73] J. Liou, K. Cheng, S. K undu, and A. Krstic. F ast statistical timing analysis by probabilistic e v ent propagation. Pr oc. of D A C pages 661666, 2001. [74] M. Hashimoto and H. Onodera. A Performance Optimization Method by Gate Sizing using Statistical Static T iming Analysis. Pr oc. of ISPD pages 111116, 2000. [75] L. Macchiarulo, E. Macii and M. Poncino. W ire Placement for Crosstalk Ener gy Minimization in Address Buses. Pr oc. of D A TE pages 158162, 2002. [76] M. Mani, A. De vgan and M. Orshansk y An Ef cient Algorithm for Statistical Minimization of T otal Po wer under T iming Y ield Constraints. Pr oc. of D A C pages 309314, 2005. [77] M. Mani and M. Orshansk y A Ne w Statistical Optimization Algorithm for Gate Sizing. Pr oc. of ICCD pages 272277, 2004. [78] A. Muruga v el and N. Ranganathan. Gate Sizing and Buf fer Insertion using Economic models for Po wer Optimization. Pr oc. of VLSI Design pages 195200, 2004. 113 PAGE 125 [79] K. Bhattacharya and N. Ranganathan. A linear programming formulation for securitya w are gate sizing. Pr oc. of GLSVLSI pages 273278, 2008. [80] S. Bhardw aj, Y Cao and S. Vrudhula. Statistical leakage minimization through joint selection of gate sizes gate lengths and threshold v oltage. Pr oc. of ASPD A C 2006. [81] S. Sapatnekar V Rao, and P V aidya. An e xact solution to the transistor sizing problem for cmos circuits using con v e x optimization. IEEE T r ans. on CAD 12(11):1621163 4, 1993. [82] N. W este, D. Harris, and A. Banerjee. CMOS VLSI Design: A circuits and systems perspecti v e. P ear son/AddisonW esle y 2005. [83] X. Bai, C. V iswesw ariah, P Strenski and D. Hatha w ay UncertaintyA w are Circuit Optimization. Pr oc. of D A C pages 5863, 2002. [84] Q. Zhou and K. Mohanram. Gate sizing to radiation harden combinational logic. IEEE T r ans. on CAD 25(1):155166, 2006. [85] K. Mohanram and N. T ouba. CostEf fecti v e Approach for Reducing Soft Error F ailure Rate in Logic Circuits. Pr oc. of ITC 893901, 2003. [86] N. Sadler and D. Sorin. Choosing an error protection scheme for a microprocessor' s l1 data cache. In Pr oc. of the ICCD 2006. [87] G. Reinman and N. Jouppi. An inte grated cache timing and po wer model. In Compaq WRL Report 1999. [88] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational beha vior to reduce cache leakage po wer In Pr oc. of the ISCA pages 240251, 1930. [89] S. Gopal, T V ijaykumar J. Smith, and G. Sohi. Speculati v e v ersioning cache. In Pr oc. of the ISCA pages 195205, 1998. [90] D. Brooks and M. Martonosi. Dynamically e xploiting narro w width operands to impro v e processor po wer and performance. In Pr oc. of the HPCA 1999. [91] J. Hu, S. W ang, and S. Zia vras. Inre gister duplication: Exploiting narro wwidth v alue for impro ving re gister le reliability In Pr oc. of the Intl. Conf on Dependable Systems and Networks v olume 0, pages 281290, 2006. [92] M. Kadiyala and L. Bhuyan. A dynamic cache subblock design to reduce f alse sharing. In Pr oc. of ICCD 1995. [93] H. Lee, G. T yson, and M. F arrens. Eager writebacka technique for impro ving bandwidth utilization. In Pr oc. of the Symp. on Micr oar c hitectur e pages 1121, 2000. [94] D. Bur ger and T Austin. The simplescalar tool set, v ersion 2.0. In A CM SIGARCH Computer Ar c hitectur e Ne ws v olume 25, pages 1325, 1997. [95] SPEC2000 benchmarks. In http://www .specbenc h.or g/osg /cp u2 000 / 114 PAGE 126 [96] D. Brooks, V T iw ari, and M. Martonosi. W attch: a frame w ork for architecturalle v el po wer analysis and optimizations. In Pr oc. of ISCA pages 8394, 2000. [97] D. Sinha and H. Zhou. Gate sizing for crosstalk reduction under timing constraints by Lagrangian relaxation. Pr oc. of ICCAD pages 1419, 2004. [98] D. Sinha and H. Zhou. Y ield dri v en gate sizing for couplingnoise reduction under uncertainty Pr oc. of ASPD A C pages 192197, 2005. [99] T Xiao and M. MarekSado wska. Crosstalk reduction by transistor sizing. Pr oc. of ASPD A C pages 137140, 1999. [100] K. Nepal, R. Bahar J. Mundy W P atterson and A. Zasla vsk y MRF Reinforcer: A Probabilistic Element for Space Redundanc y in Nanoscale Circuits. Pr oc. of IEEE MICR O pages 1927, 2006. [101] V Mahalingam, N. Ranganathan and J. Harlo w III. A Fuzzy Optimization Approach for V ariation A w are Po wer Minimization During Gate Sizing. IEEE T r ans. on VLSI 16(8):975 984, 2008. [102] P V erplaetse, J. Dambre, D. Stroobandt and J. Campenhout. On partitioning vs. placement rent properties. Pr oc. of SLIP 3340, 2001. [103] K. Bhattacharya and N. Ranganathan. Reliabilitycentric gate sizing with simultaneous optimization of soft error rate, delay and po wer Pr oc. of ISLPED pages 99104, 2008. [104] R. Bahar Nanoscale Circuits and Architectures for Probabilistic Computation in the Presence of Noise. Pr oc. of FN ANO In vited paper 2006. [105] International T echnology Roadmap for Semiconductors. http://www .itrs.net/Links/2001I TRS/Home.htm. [106] J. Singh, V Nookala, Z. Luo and S.Sapatnekar Rob ust gate sizing by geometric programming. Pr oc. of D A C 315320, 2005. [107] N. Hanchate and N. Ranganathan. Statistical Gate Sizing for Y ield Enhancement at Post Layout Le v el. Pr oc. of ISVLSI 245252, 2007. [108] N. Ranganathan, U. Gupta and V Mahalingam. Simultaneous optimization of total po wer crosstalk noise, and delay under uncertainty Pr oc. of GLSVLSI pages 171176, 2008. [109] S. W ang and J. Hu and S.Zia vras. SelfAdapti v e Data Caches for SoftError Reliability IEEE T r ansactions on Computer Aided Design of Inte gr ated Cir cuits and Systems 27(8), pages 15031507, 2008. [110] J. Hu, F Li, V De galahal, M. Kandemir N. V ijaykrishnan and M. Irwin. Compiler Assisted Soft Error Detection under Performance and Ener gy Constraints in Embedded Systems. A CM T r ansactions on Embedded Computing Systems 8(4), pages 130, 2009. [111] L. Li, V De galahal, N. V ijaykrishnan, M. Kandemir and M. Irwin. Soft error and ener gy consumption interactions: a data cache perspecti v e. Pr oc. of the ISLPED pages 132137, 2004. 115 PAGE 127 LIST OF PUBLICA TIONSK. Bhattacharya, N. Ranganathan and S. Kim, A Frame w ork for Correction of Multibit Soft Errors in L2 Caches Based on Redundanc y, IEEE T r ans. on VLSI Systems 17(2), pp. 194206, 2009.V Mahalingam, K. Bhattacharya, N. Ranganathan, H. Chakra v arthula, R. Murphy and K. Pratt, A VLSI Architecture and Algorithm for LucasKanade Based Optical Flo w Computation, IEEE T r ans. on VLSI Systems to appear 2009.K. Bhattacharya and N. Ranganathan, A Ne w Placement Algorithm for Reduction of Soft Errors in Macro Cell based Design of Nanometer Circuits, Pr oc. of Annual Symp. on VLSI (ISVLSI) pp. 9196, 2009.K. Bhattacharya and N. Ranganathan, RADJ AM: A No v el Approach for Reduction of Soft Errors in Logic Circuits, Pr oc. of the 22nd Intl. Conf VLSI Design (VLSID) pp. 453458, 2009.K. Bhattacharya and N. Ranganathan, A Unied Gate Sizing F ormulation for Optimizing Soft Error Rate, Crosstalk Noise and Po wer under Process V ariations, Pr oc. of the 10th Intl. Symp. on Quality Electr onic Design (ISQED) pp. 388393, 2009.K. Bhattacharya, M. V enkataraman and N. Ranganathan, A VLSI System Architecture for Optical Flo w Computation, Pr oc. of the 42nd Intl. Symp. on Cir cuits and Systems (ISCAS) pp. 357360, 2009.R. Hyman Jr ., K. Bhattacharya and N. Ranganathan, A Strate gy for Soft Error Reduction in Multicore Designs, Pr oc. of the 42nd Intl. Symp. on Cir cuits and Systems (ISCAS) pp. 22172220, 2009.N. Ranganathan and K.Bhattacharya, Methodology and Apparatus for Reduction of Soft Errors in Logic Circuits, Pro visional P atent Application led on June 13, 2008. PAGE 128 K. Bhattacharya and N. Ranganathan, A Linear Programming F ormulation for SecurityA w are Gate Sizing, Pr oc. of the 19th Gr eat Lak es Annual Symp. on VLSI Design (GLSVLSI) pp. 273278, 2008. (Nominated for Best P aper A w ard: Rank ed among top 6 out of 220 submissions and 40 accepted papers).K. Bhattacharya and N. Ranganathan, Reliabilitycentric Gate Sizing with Simultaneous Optimization of Soft Error Rate, Delay and Po wer, Pr oc. of the 13th Intl. Symp. on Low P ower Electr onics and Design (ISLPED) pp. 99104, 2008.K. Bhattacharya, S. Kim, and N. Ranganathan. Impro ving the Reliability of Onchip L2 cache Using Redundanc y, Pr oc. of the 25th Intl. Conf on Computer Design (ICCD) pp. 224229, 2007.K. Bhattacharya and N. Ranganathan, A No v el Radiation Block er Circuit and its Selecti v e Insertion for Soft Error Mitigation, IEEE T r ans. on VLSI Systems (2nd Re vie w).K. Bhattacharya and N. Ranganathan, Placement for Radiation Immunity in Cell Based Design of Nanometer Circuits, IEEE T r ans. on VLSI Systems (2nd Re vie w). PAGE 129 ABOUT THE A UTHOR K ousta v Bhattacharya recei v ed his B.T ech de gree in Computer Engineering from Kalyani Uni v ersity W est Bengal, India in 2002 and his Master' s de gree in Computer T echnology from the Indian Institute of T echnology Delhi, India in 2004. In 2004, he w ork ed as a design engineer in ST Microelectronics, Noida, India. He w as a w arded the Richard E. Merwin Scholarship in 2007. His research interests include Design Automation, VLSI Design and T est, Computer Architecture, Design for Reliability Design for Manuf acturability and FPGA Design. 