USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001498127 003 fts 006 med 007 cr mnuuuuuu 008 041209s2004 flua sbm s0000 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0000523 035 (OCoLC)57715421 9 AJU6722 b SE SFE0000523 040 FHM c FHM 090 TK7885 (ONLINE) 1 100 Li, Hao, d 1972 May 13 0 245 Low power technology mapping and performance driven placement for field programmable gate arrays h [electronic resource] / by Hao Li. 260 [Tampa, Fla.] : University of South Florida, 2004. 502 Thesis (Ph.D.)University of South Florida, 2004. 504 Includes bibliographical references. 500 Includes vita. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. Title from PDF of title page. Document formatted into pages; contains 134 pages. 520 ABSTRACT: As technology geometries have shrunk to the deep submicron (DSM) region, the chip density and clock frequency of FPGAs have increased significantly. This makes computeraided design (CAD) for FPGAs very important and challenging. Due to the increasing demands of portable devices and mobile computing, low power design is crucial in CAD nowadays. In this dissertation, we present a framework to optimize power consumption for technology mapping onto FPGAs. We propose a lowpower technology mapping scheme which is able to predict the impact of choosing a subnetwork covering on the ultimate mapping solution. We dynamically update the power estimation for a sequence of options and choose the one that yields the least power consumption. This technique outperforms the best lowpower mapping algorithms reported in the literature. We further extend this work to generate mapping solutions with optimal delay.We also propose placement algorithms to optimize the performance of the placed circuit. Net cluster based methodology is designed to ensure closely connected nets will be routed in the same region. Net cluster is obtained by clique partitioning on the net dependency graph. Net positions and consequent cell positions are computed with a forcedirected approach which drags nets connected to closer positions. We further study the performancedriven placement problem for high level synthesis. We use the Automatic Design Instantiation (AUDI) high level synthesis system to generate a registertransistor level (RTL) netlist. This RTL netlist is fed into a CAD tool for physical synthesis. We do not necessarily go through the entire physical design process which is usually quite timeconsuming. Instead, we have created an accurate wirelength/timing estimator working on the floorplan.If the estimated timing information does not meet the constraints, a guidance is generated and provided to AUDI system. The guidance consists of the estimated timing information and instructions to produce a new netlist in order to improve the performance. Finally the circuit is placed and routed on a satisfying design. This performancedriven placement framework yields better results as compared to a commercial CAD tool. 590 Adviser: Katkoori, Srinivas. 653 logic synthesis. power minimization. FPGA. high level synthesis. physical synthesis and design. network flow. 690 Dissertations, Academic z USF x Computer Science and Engineering Doctoral. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.523 PAGE 1 Lo w P o w er T ec hnology Mapping and P erformance Driv en Placemen t for Field Programmable Gate Arra ys b y Hao Li A dissertation submitted in partial fulllmen t of the requiremen ts for the degree of Do ctor of Philosoph y Departmen t of Computer Science and Engineering College of Engineering Univ ersit y of South Florida Ma jor Professor: Sriniv as Katk o ori, Ph.D. N. Ranganathan, Ph.D. Miguel Labrador, Ph.D. Wilfrido Moreno, Ph.D. Stephen Suen, Ph.D. Date of Appro v al: No v em b er 9, 2004 Keyw ords: ph ysical syn thesis and design, net w ork ro w, high lev el syn thesis, logic syn thesis, p o w er minimization, FPGA c r Cop yrigh t 2004, Hao Li PAGE 2 DEDICA TION T o m y dearest grandparen ts, m y paren ts, m y wife, and those who ha v e help ed and encouraged me in m y life. In memory of m y fatherinla w, Li Binsheng. PAGE 3 A CKNO WLEDGEMENTS I w ould lik e to express gratitude for m y ma jor professor, Dr. Sriniv as Katk o ori, for his guidance, supp ort and encouragemen t throughout m y do ctoral degree program. I also sincerely thank Dr. W aiKei Mak, m y previous coma jor professor at the Univ ersit y of South Florida for his kindness, help and instruction. They led me in to the seemingly arduous but actually exciting researc h w orld. And they made me b eliev e that I am a Ph.D. material. Sp ecial thanks to Dr. Ranganathan, Dr. Labrador, Dr. Moreno, and Dr. Suen for b eing on m y Ph.D. committee and pro viding v aluable advice. I also w an t to thank Dr. Carnahan for c hairing m y defense. I w ould also lik e to thank the faculties and stas of the CSE departmen t at USF for giving me enormous help. I thank all the mem b ers of the V CAPP group who ha v e taugh t me a lot of things, just name a few: Suv o deep, Sara ju, Mouli, Stelian, Ashok, Sanjukta. I w ould also thank m y friends: T ong, Guitao, Y anfei, Zhibin, and Dongqing for their help and friendship. PAGE 4 T ABLE OF CONTENTS LIST OF T ABLES iv LIST OF FIGURES v ABSTRA CT vii CHAPTER 1 INTR ODUCTION 1 1.1 VLSI Design and ComputerAided Design 2 1.1.1 T ypical VLSI Design Cycle 3 1.1.2 New T rends in VLSI Design 4 1.2 Ov erview of FPGAbased Designs 5 1.2.1 FPGA In terconnect Arc hitecture 6 1.2.2 FPGA Logic Blo c k Arc hitecture 8 1.3 Lo w P o w er Design 8 1.3.1 Static P o w er Dissipation 9 1.3.2 Dynamic P o w er Dissipation 9 1.3.3 T otal P o w er Dissipation 10 1.4 High Lev el Syn thesis 10 1.4.1 Basic Concept of High Lev el Syn thesis 10 1.4.2 Motiv ation of High Lev el Syn thesis 11 1.4.3 Phases of High Lev el Syn thesis 12 1.5 Dissertation Con tributions 14 1.6 Dissertation Ov erview 15 CHAPTER 2 BA CK GR OUND AND RELA TED W ORK 16 2.1 FPGA Arc hitecture 16 2.2 CAD Flo w for FPGA Ph ysical Design 19 2.2.1 Design En try 19 2.2.2 Initial Syn thesis 21 2.2.3 F unctional Sim ulation 21 2.2.4 Logic Syn thesis and Optimization 22 2.2.5 Ph ysical Design 23 2.2.5.1 P artitioning 24 2.2.5.2 Flo orplanning and Placemen t 25 2.2.5.3 Routing 25 2.2.5.4 Compaction 26 2.2.5.5 Extraction and V erication 26 i PAGE 5 2.2.6 Timing Sim ulation 27 2.3 T ec hnology Mapping for LUTbased FPGAs 27 2.3.1 T ec hnology Mapping for Dela y Optimization 28 2.3.2 T ec hnology Mapping for Area Minimization 30 2.3.3 T ec hnology Mapping for Routabilit y and Lo w P o w er 32 2.4 FPGA Placemen t 32 2.4.1 Optimization Ob jectiv es of Placemen t 33 2.4.1.1 Estimation of Wirelength 33 2.4.1.2 Minimize T otal Wirelength 34 2.4.1.3 Minimize Maxim um Densit y 36 2.4.2 Placemen t Approac hes 36 2.4.2.1 P artitioningbased Placemen t Algorithms 37 2.4.2.2 Analyticbased Placemen t Algorithms 39 2.4.2.3 Sim ulated Annealing Placemen t Algorithms 41 2.4.2.4 Summary of Dieren t Placemen t Algorithms 43 2.5 Conclusion 44 CHAPTER 3 LO W PO WER TECHNOLOGY MAPPING F OR LUTBASED FPGAS 45 3.1 In tro duction 46 3.2 Problem F orm ulation 47 3.3 P o w er Estimation Mo del 49 3.4 P o w er Minimization Algorithm 52 3.4.1 Phase I I: Computation of EP(v) 53 3.4.1.1 TBounded K F easible Cut 53 3.4.1.2 Incremen tal Net w ork Flo w Computation 55 3.4.2 Phase I I I: Mapping Generation 56 3.4.3 Computational Complexit y of P o w erMinMap 59 3.5 P o w erMinMapd: Sim ultaneous P o w er and Dela y Optimization 60 3.5.1 Review of Flo wmap Algorithm 61 3.5.2 P o w erMinMapd Algorithm 61 3.5.3 Computational Complexit y of P o w erMinMapd 64 3.6 Exp erimen tal Results of Lo w P o w er T ec hnology Mapping Algorithms 66 3.7 Conclusions on Lo w P o w er T ec hnology Mapping 69 CHAPTER 4 PERF ORMANCEDRIVEN F OR CEDIRECTED PLA CEMENT ALGORITHM F OR HIERAR CHICAL FPGAS 71 4.1 In tro duction 72 4.2 Hierarc hical Xilinx FPGA Arc hitecture 74 4.3 Prop osed Placemen t Algorithm 75 4.3.1 Ov erview of the Algorithm 76 4.3.2 Net Cluster Flo orplanning 76 4.3.3 Coarse Netlev el Placemen t 81 4.3.3.1 A ttractiv e and Repulsiv e F orces 81 4.3.3.2 Net Placemen t 83 4.3.4 Logic Cell Placemen t 85 ii PAGE 6 4.3.5 I/O Pin Matc hing 87 4.3.6 Summary of the Prop osed Algorithm 88 4.4 Exp erimen tal Results 89 4.4.1 Comparison with Xilinx F oundation T o ol 90 4.4.2 Comparison with VPR 92 4.5 Conclusions and F uture W ork 94 CHAPTER 5 HIGH LEVEL SYNTHESIS F OR PERF ORMANCE DRIVEN PLA CEMENT 96 5.1 Automatic Design Instan tiation System (A UDI) 97 5.2 P erformance Driv en Placemen t with High Lev el Syn thesis 100 5.2.1 Ov erview of the Prop osed Design Flo w 102 5.2.2 Estimation of the Design P erformance 103 5.2.3 Iterativ e Design Space Searc h 104 5.3 Exp erimen tal Results 105 5.4 Conclusions and F uture W ork 108 CHAPTER 6 CONCLUSIONS AND FUTURE W ORK 111 REFERENCES 114 ABOUT THE A UTHOR End P age iii PAGE 7 LIST OF T ABLES T able 1.1 T ransistors on an In tel Pro cessor Ov er the Y ears. 2 T able 3.1 Comparison of P o w erMinMap with [1 ] and [2 ] (PWR: mW). 67 T able 3.2 Comparison of P o w erMinMapd and Cutmap (P o w er: mW). 68 T able 3.3 Comparison of PMMd and Cutmap with Randomly Generated T ransition Densities for PI No des. 69 T able 4.1 List of Constan ts Used in Our W ork. 88 T able 4.2 Characterics of Com binational Circuits. 91 T able 4.3 Comparison with Xilinx F oundation for Com binational Circuits. 92 T able 4.4 Characterics of Sequen tial Circuits. 92 T able 4.5 Comparison with Xilinx F oundation for Sequen tial Circuits. 93 T able 4.6 Comparison with VPR. 94 T able 5.1 Description of Beha vioral Benc hmarks for A UDI System. 106 T able 5.2 High Lev el Syn thesis for P erformance Driv en Placemen t. 107 T able 5.3 Dela y Estimation for \latt". 108 iv PAGE 8 LIST OF FIGURES Figure 1.1 Mo ore's La w on In tel Pro cessor Series. 1 Figure 1.2 T ypical VLSI Design Cycle. 3 Figure 1.3 T ypical Flo w of High Lev el Syn thesis. 12 Figure 1.4 An Example Data Flo w Graph and a Sc hedule. 13 Figure 2.1 (a) An Example of 2input LUT. (b) A Finegrained Logic Blo c k with a Fliprop. 17 Figure 2.2 Multiplexerbased Logic Mo dule Used b y Actel. 17 Figure 2.3 Arc hitecture of an Arra ybased FPGA. 18 Figure 2.4 T ypical Design Flo w of a CAD System. 20 Figure 2.5 Logic Syn thesis Design Flo w. 23 Figure 2.6 Ph ysical Design Flo w. 24 Figure 2.7 Global and Detailed Routing. 26 Figure 2.8 V arious LUT Mappings for a Bo olean Net w ork ( K = 4): (a) Original Bo olean Net w ork; (b) Duplicationfree Mapping; (c) Mapping with Ov erlapping LUTs. 28 Figure 2.9 Dieren t T ec hniques for Wirelength Estimation. 35 Figure 2.10 V ertical and Horizon tal Cutlines for a Placemen t. 38 Figure 2.11 Dieren t Sequences of Cut Lines for Mincut Placemen t. 40 Figure 2.12 Outline of Sim ulated Annealing Placemen t Algorithm. 42 Figure 3.1 P o w er Dissipation of FPGAs. 46 Figure 3.2 A 3feasible Cut for No de v. 48 Figure 3.3 A Mapp ed Net w ork in to 3LUTs Ro oted at l (No des a, b, c, d and e are PI no des). 52 v PAGE 9 Figure 3.4 (a) Selecting a 3feasible Cut for No de v (b) The Mapping Solution. 54 Figure 3.5 (a) Initial Flo w Net w ork (for T =5.2). (b) Residual Net w ork after the Maxro w is Computed. (c) The Up dated Residual Net w ork with a New Augmen ted P ath Sho wn in Bold Edges. 57 Figure 3.6 Cut F ron tier Renemen t for Net w ork Ro oted at v (Assuming K =4). 58 Figure 3.7 Pseudo co de of P o w erMinMap Algorithm. 59 Figure 3.8 Lab els Computed for a Bo olean Net w ork Assuming K = 3. 62 Figure 3.9 Dieren t Mappings Assuming K = 3: (a) Using Flo wmap and (b) Using PMMd. 64 Figure 3.10 Pseudo co de of P o w erMinMapd Algorithm 65 Figure 4.1 T op Lev el View of Xilinx Hierarc hical FPGA. 74 Figure 4.2 Simplied Arc hitecture of an CLB. 75 Figure 4.3 Design Flo w of the Prop osed Placemen t Algorithm. 77 Figure 4.4 Example of Net Clustering: (a) Netlist (b) Net Dep endency Graph. 79 Figure 4.5 (a) Star Mo del of a 5pin Net. (b) Complete Graph Mo del of a 5pin Net. 84 Figure 4.6 F orceDirected P erformanceDriv en Placemen t Algorithm. 89 Figure 4.7 Exp erimen tal Flo w of Our Algorithm. 90 Figure 5.1 R TLev el Design Mo del of A UDI System. 99 Figure 5.2 Beha vioral Description of Design \mx2". 100 Figure 5.3 (a) DF G Represen tation of \mx2". (b) A Sc heduling for \mx2". 100 Figure 5.4 Datapath Information. 101 Figure 5.5 Ov erview of Design Flo w. 102 Figure 5.6 Dela y Estimation and Cost Con v ergence for \latt". 109 vi PAGE 10 LO W PO WER TECHNOLOGY MAPPING AND PERF ORMANCE DRIVEN PLA CEMENT F OR FIELD PR OGRAMMABLE GA TE ARRA YS Hao Li ABSTRA CT As tec hnology geometries ha v e shrunk to the deep submicron (DSM) region, the c hip densit y and clo c k frequency of FPGAs ha v e increased signican tly This mak es computeraided design (CAD) for FPGAs v ery imp ortan t and c hallenging. Due to the increasing demands of p ortable devices and mobile computing, lo w p o w er design is crucial in CAD no w ada ys. In this dissertation, w e presen t a framew ork to optimize p o w er consumption for tec hnology mapping on to FPGAs. W e prop ose a lo wp o w er tec hnology mapping sc heme whic h is able to predict the impact of c ho osing a subnet w ork co v ering on the ultimate mapping solution. W e dynamically up date the p o w er estimation for a sequence of options and c ho ose the one that yields the least p o w er consumption. This tec hnique outp erforms the b est lo wp o w er mapping algorithms rep orted in the literature. W e further extend this w ork to generate mapping solutions with optimal dela y W e also prop ose placemen t algorithms to optimize the p erformance of the placed circuit. Net cluster based metho dology is designed to ensure closely connected nets will b e routed in the same region. Net cluster is obtained b y clique partitioning on the net dep endency graph. Net p ositions and consequen t cell p ositions are computed with a forcedirected approac h whic h drags nets connected to closer p ositions. W e further study the p erformancedriv en placemen t problem for high lev el syn thesis. W e use the Automatic Design Instan tiation (A UDI) high lev el syn thesis system to generate an registertransistor lev el (R TL) netlist. This R TL netlist is fed in to an CAD to ol for ph ysical syn thesis. W e do not necessarily go through the en tire ph ysical design pro cess whic h is usually quite timeconsuming. Instead, vii PAGE 11 w e ha v e created an accurate wirelength/timing estimator w orking on the ro orplan. If the estimated timing information do es not meet the constrain ts, a guidance is generated and pro vided to A UDI system. The guidance consists of the estimated timing information and instructions to pro duce a new netlist in order to impro v e the p erformance. Finally the circuit is placed and routed on a satisfying design. This p erformancedriv en placemen t framew ork yields b etter results as compared to a commercial CAD to ol. viii PAGE 12 CHAPTER 1 INTR ODUCTION According to Mo or e's law [3 ], the total n um b er of transistors p er c hip and the micropro cessor's p erformance (measured b y millions of instructions p er second (MIPS)) will b e doubled ev ery 1.5 to 2 y ears. It has b een a k ey trend indicator for the semiconductor industry correctly for the past 30 y ears. Figure 1.1 sho ws the n um b er of transistors on an In tel pro cessor has b een increasing steadily as Mo ore's La w indicates. The corresp onding pro cessor t yp e, y ear to app ear in the mark et and transistor n um b ers are giv en in T able 1.1. 1e+03 1e+04 1e+05 1e+06 1e+07 1e+08 1e+09 1970 1975 1980 1985 1990 1995 2000 2005Transistors Figure 1.1. Mo ore's La w on In tel Pro cessor Series. As the pro cess tec hnology adv anced to the deep submicron (DSM) region, the logic capacit y and p erformance of a v ery large scaled in tegration (VLSI) circuit has gro wn rapidly The latest commercially a v ailable micropro cessor from In tel is man ufactured with 90 nm tec hnology and its op erating frequency has reac hed 3.4 GHz while the supply v oltage has decreased to 1.4 V. Designing VLSI circuits man ually or from scratc h has b ecome virtually 1 PAGE 13 T able 1.1. T ransistors on an In tel Pro cessor Ov er the Y ears. Pro cessor Y ear T ransistors 4004 1971 2,250 8008 1972 2,500 8080 1974 5,000 8086 1978 29,000 286 1982 120,000 386 pro cessor 1985 275,000 486 pro cessor 1989 1,180,000 P en tium pro cessor 1993 3,100,000 P en tium I I pro cessor 1997 7,500,000 P en tium I I I pro cessor 1999 24,000,000 P en tium 4 pro cessor 2000 42,000,000 Itanium pro cessor 2002 220,000,000 Itanium 2 pro cessor 2003 410,000,000 imp ossible and infeasible. Therefore, ComputerAided Design (CAD) to ols are essen tial to all VLSI c hip designers in ev ery design lev el. The CAD to ols curren tly in use are highly sophisticated suc h that most design pro cess can b e tak en care of automatically This greatly exp edites the design cycle and reduces the c hances of design errors. As a result, the costs of VLSI c hips ha v e dropp ed dramatically while their p erformance ha v e increased signican tly o v er the y ears. In this c hapter, w e will pro vide the outline of VLSI design and computeraided design. In particular, w e fo cus on FPGA based design, p o w er minimization, and high lev el syn thesis. This c hapter is organized as follo ws: W e giv e an outline of VLSI design and corresp onding new design trends in Section 1.1. W e presen t an o v erview of FPGA design in Section 1.2. W e sho w the motiv ation b ehind lo w p o w er design in Section 1.3. W e in tro duce the concept of high lev el syn thesis and related topics in Section 1.4. W e presen t the con tributions of our w ork in Section 1.5. W e giv e the outline of this dissertation in Section 1.6. 1.1 VLSI Design and ComputerAided Design The design of digital systems can b e ac hiev ed at man y dieren t renemen t lev els from the most abstract arc hitecture do wn to the most detailed la y out. Giv en the enormous com2 PAGE 14 plexit y and v arious constrain ts required to b e met at all lev els, computeraided design has b een utilized extensiv ely at eac h lev el in mo dern time. It is unimaginable to use man ual design tec hniques for designing digital circuits, in whic h eac h la y er is hand etc hed or comp osed b y la ying tap e on lm. 1.1.1 T ypical VLSI Design Cycle A t ypical VLSI design cycle is sho wn in Figure 1.2 [4 ]. Physical Design Functional Design System SpecificationArchitectural DesignLogic/Circuit Design Fabrication and Testing Figure 1.2. T ypical VLSI Design Cycle. Ev en though the details of v arious VLSI systems are dieren t, w e can briery outline all the common phases of the VLSI design cycle as follo ws: System sp ecication: A high lev el description of the digital system for the size, sp eed, p o w er, and functionalit y of the system. 3 PAGE 15 Arc hitectural design: A microarc hitectural sp ecication (MAS) includes n um b er of ALUs, roating p oin t units, n um b er and structure of pip elines, c hoice of RISC or CISC, etc. F unctional design: Also kno wn as b eha vioral design. The b eha vioral asp ects of the system and in terconnect demands are iden tied without considering implemen tation related details. Logic and circuit design: In logic design phase, a registertransistor lev el (R TL) description is deriv ed sp ecifying the con trol ro w, w ord widths, register allo cation, arithmetic op erations, and logic op erations. Then, the R TL netlist is giv en to the logic syn thesis to ol to generate a detailed circuit diagram consisting of cells, macros, gates, transistors, and in terconnection b et w een these comp onen ts. Ph ysical design: The circuit represen tation or netlist is con v erted in to a geometric represen tation called l ay out The la y out is obtained b y con v erting eac h logic elemen t in to a geometric represen tation with sp ecic shap e. In terconnection b et w een the logic comp onen ts are also expressed as lines in m ultiple la y ers. Ph ysical design is a complex pro cess and hence it is t ypically brok en do wn in to v arious substeps. V erication and v alidation c hec ks are p erformed on the la y out during ph ysical design. F abrication and testing: F abrication is the pro cess to dep ose and diuse v arious material on the w afer. Then eac h c hip is pac k aged and tested to ensure that all design sp ecications are met. In general, the ob jectiv es of VLSI CAD to ols are to minimize the running time of eac h step discussed ab o v e, th us reducing timetomark et and optimize the p erformance of the system based on usersp ecied constrain ts. 1.1.2 New T rends in VLSI Design With the dramatic increase in circuit complexit y and decrease in tec hnology geometry there are man y new trends in VLSI design that need to b e tak en in to accoun t at presen t. 4 PAGE 16 In terconnect dela y dominates: As the fabrication tec hnology adv ances to deep submicron (DSM) region, the in terconnect is con tributing more and more to the path dela y One solution to in terconnect and signal in tegrit y is to insert rep eaters in long wires. Adv anced planning b ecomes necessary b ecause area o v erhead for rep eaters m ust b e allo cated upfron t. More metal la y ers: The n um b er of metal la y ers is increasing to meet the need for in terconnection. Three la y ers are commonly used while four or v e la y er pro cess are adopted mainly for micropro cessors. So a three dimensional view of the in terconnect is necessary Syn thesis: Design time can b e reduced if la y out can b e straigh tly syn thesized from a higher lev el description. Dep ending on the lev el of design on whic h syn thesis is utilized, there are t w o t yp es of syn thesis. I. Logic syn thesis: It con v erts an HDL description in to a circuit description (sc hematics) and then pro duces its la y out. Logic syn thesis is a w ell established tec hnology for designing blo c ks of a c hip, and for application sp ecic in tegrated circuits (ASICs). It is not applicable for larger blo c ks, suc h as RAMs, R OMs, datapaths, and micropro cessors b ecause of slo w sp eed and area ineciency I I. High lev el syn thesis: This pro cess con v erts a b eha vioral asp ects of the system in to a la y out or R TL description. W e will pro vide more details on high lev el syn thesis in Section 1.4. 1.2 Ov erview of FPGAbased Designs The eld programmable gate arra y (FPGA) w as rst in tro duced in 1985 b y Xilinx. Presen tly it has b ecome one of the most p opular devices utilized in curren t VLSI system and rapid system protot yping. The k ey to FPGA's p opularit y is their capabilit y to implemen t any digital system due to its programmabilit y Compared with custom design tec hnologies, suc h as Standard Cells or semicustom designs suc h as MaskProgrammed 5 PAGE 17 Gate Arra ys (MPGAs), to use FPGAs has t w o apparen t b enets: lo w er nonrecurring engineering (NRE) costs, and faster timetomark et. This mak es FPGAs the lo w est cost implemen tation platform for small and medium v olume designs. And if an y fault is found in an FPGA based design, it can b e corrected so on b y reprogramming the FPGA. T o meet to da y's comp elling requiremen t of short pro duct cycle, FPGAs are preferable due to the smaller timetomark et. Ho w ev er, FPGA also has its dra wbac ks mainly due to the in terconnect tec hnology it utilizes. Unlik e standard cell tec hnology where circuitry is connected via metal wires, FPGA connect circuitry using programmable switc hes. The switc hes ha v e higher series resistance than the metal wires and add signican t parasitic capacitance to interconnections, resulting in lo w er circuit sp eed. In addition, the programmable switc h uses more area than metal wire do es whic h mak es FPGA implemen tations more exp ensiv e for high v olume designs. T o implemen t the same circuit, the one implemen ted using FPGA is usually 10 times larger and 3 times slo w er than the same circuit implemen ted in an MPGA in an equiv alen t pro cess [5 ]. An FPGA c hip consists of an arra y of uncommitted programmable logic blo c ks, programmable in terconnections, and I/O pads. The lo okuptable (LUT) based arc hitecture is the most p opular arc hitecture used b y sev eral FPGA man ufacturers, including Xilinx [6 ] and A T&T [7 ]. The programming tec hnology determines the metho d of storing the conguration information, and comes in v arious ra v ors. This has a strong impact on the area and p erformance of the arra y The main programming tec hnologies are: Static Random Access Memory (SRAM), an tifuse, and nonv olatile tec hnologies using roating gates. The selection of the programming tec hnology is based on the computation en vironmen t in whic h the FPGA device is used. 1.2.1 FPGA In terconnect Arc hitecture In FPGAs, the in terconnect arc hitecture is realized using programmable switc hes to implemen t dieren t connections. The metho d of pro viding the connectivit y b et w een the logic blo c ks has a strong impact on the c haracteristics of the FPGA arc hitecture. The 6 PAGE 18 arrangemen t of the logic and in terconnect resources can b e broadly classied in to four categories: island st yle, ro wbased, seaofgates, and hierarc hical. Island style ar chite ctur e : It consists of an arra y of programmable logic blo c ks with v ertical and horizon tal programmable routing c hannels. The n um b er of segmen ts in the c hannel determine the resources a v ailable for routing. The X C4000 and X C3000 series from Xilinx [6] are examples of this t yp e of arc hitecture. R owb ase d ar chite ctur e : It has logic blo c ks arranged in ro ws with horizon tal routing c hannel b et w een successiv e ro ws. The routing trac ks within the c hannel are divided in to one or more segmen ts. The length of the segmen ts can v ary from the width of a blo c k pair to the full length of the c hannel. Other trac ks run v ertically through the logic blo c ks. They pro vide connections b et w een the horizon tal routing c hannels. The A CT3 family of FPGAs from Actel [8 ] is an example of this arc hitecture. Se aofgates ar chite ctur e : Unlik e the previous arc hitectures, it is not an arra y of logic blo c ks em b edded in a general routing structure. This arc hitecture consists of negrain logic blo c ks co v ering the en tire ro or of the device. Connectivit y is realized using dedicated neigh b ortoneigh b or routes that are usually faster than general routing resources. The SX family of FPGAs from Actel [9 ] is a an example of this class of arc hitecture. Hier ar chic al A r chite ctur e: Most logic designs exhibit lo calit y of connections, whic h implies a hierarc h y in the placemen t and routing of the connections b et w een the logic blo c ks. The hierarc hical FPGA arc hitecture exploits this feature to pro vide smaller routing dela ys and a more predictable timing b eha vior. It is created b y connecting logic blo c ks in to clusters. These clusters are recursiv ely connected to form a hierarc hical structure. The sp eed of a net is determined b y the n um b er of routing switc hes it has to pass through. The hierarc hical structure reduces the n um b er of switc hes in series for long connections. The Virtex family FPGAs from Xilinx [10 ] is an example of this kind of arc hitecture. 7 PAGE 19 1.2.2 FPGA Logic Blo c k Arc hitecture The logic blo c ks on FPGAs are resp onsible for implemen ting the gate lev el functionalit y for eac h design. The functionalit y can b e dened b y the n um b er of dieren t functions it can implemen t. It has a direct impact on the routing resources. As the functional units of the logic blo c k increases, it reduces the amoun t of external routing resources. On the other hand, it ma y result in logic w astage b ecause the logic blo c k cannot b e fully utilized. Therefore, there exists a tradeo b et w een optimizing the area and sp eed for dieren t logic blo c k structures. The functionalit y of the logic blo c ks is deriv ed b y con trolling the connectivit y of some basic logic gates or b y using lo okuptables (LUTs). In LUTbased FPGAs, a K input lo okup table ( K LUT) is the basic programmable logic blo c k whic h can b e programmed to implemen t an y Bo olean function of up to K v ariables [11 ]. Curren tly FPGA's p erformance is approac hing that of an ASIC. In generally the follo wing factors aect the p erformance of an FPGA: the qualit y of the FPGA arc hitecture the electrical design of the FPGA the qualit y of the CAD to ols used to map, place, and route a design on to the FPGA. 1.3 Lo w P o w er Design The proliferation of mobile computing platforms and p ortable electronic devices has made lo w p o w er design a k ey issue in VLSI system design and CAD. The total p o w er consumption con tin ues to increase b ecause of higher op erating frequencies, higher o v erall in terconnect capacitance and resistance, and the gate leak age p o w er of the onc hip transistors. In a CMOS circuit, the amoun t of p o w er dissipation is determined b y the follo wing t w o comp onen ts [12 ]: 8 PAGE 20 Static p o w er dissipation due to the leak age and standb y p o w er dissipation. Dynamic p o w er dissipation due to short circuit (transien t switc hing curren t) and c harging and disc harging load capacitances (capacitiv e switc hing). 1.3.1 Static P o w er Dissipation L e akage p ower dissip ation happ ens b ecause of rev erse bias leak age b et w een diusion regions and the substrate. In addition, subthreshold conduction also con tributes to leak age p o w er dissipation. Standby p ower dissip ation o ccurs due to the curren t dra wn con tin uously from the p o w er supply F or example, when b oth the nMOS and pMOS are con tin uously on in a pseudonMOS in v erter, standb y p o w er dissipation happ ens. Practically standb y p o w er is negligible compared to leak age p o w er and static p o w er is measured b y leak age p o w er as sho wn in Equation 1.1. P s = n X i =1 l eak ag e cur r ent ? suppl y v ol tag e (1.1) where n is the total n um b er of transistors in the CMOS circuit. 1.3.2 Dynamic P o w er Dissipation Cap acitive switching p ower is caused b y c harging and disc harging the output capacitiv e load in the circuit and it is measured as in Equation 1.2. P d = 1 2 C L V 2 dd N f p (1.2) where C L is the load capacitance, V dd is the supply v oltage, N is the a v erage n um b er of transitions p er clo c k cycle (switc hing activit y), and f p is the clo c k frequency This indicates that the p o w er dissipation is prop ortional to the switc hing activit y but indep enden t of the device parameters. It is straigh tforw ard that circuit with slo w er clo c k frequency result in 9 PAGE 21 less p o w er dissipation, but this ma y not b e desirable for designing high sp eed system. So reducing the supply v oltage is preferred in lo w p o w er design. During the transition from 0 to 1 or 1 to 0, b oth nMOS and pMOS are switc hed on temp orarily Therefore, there exists a short curren t from V dd to V ss The p o w er dissipation b ecause of this is called short cir cuit p ower and it is computed as in Equation 1.3: P sc = 12 ( V dd 2 V t ) 3 t r f t p (1.3) where is the transistor gain factor, V dd is the supply v oltage, V t is the threshold v oltage, t r f is the rise/fall time (assuming t r = t f ), and t p is the p erio d of the input w a v eform. 1.3.3 T otal P o w er Dissipation The total p o w er dissipation is the sum of the static p o w er, short curren t p o w er, and dynamic p o w er dissipation. P total = P s + P sc + P d (1.4) In practice, a large p ortion of the total p o w er dissipation is due to the dynamic p o w er. So researc h w ork on reducing p o w er consumption has b een mainly fo cused on reducing the dynamic p o w er. 1.4 High Lev el Syn thesis High lev el syn thesis (HLS) is a p opular researc h topic in academia. In this section, w e rst in tro duce the basic concept of HLS. Then, w e discuss the adv an tages of HLS. Lastly w e presen t the design ro w of high lev el syn thesis to ol. 1.4.1 Basic Concept of High Lev el Syn thesis High lev el syn thesis (HLS) is the pro cess generating digital system from an abstract b eha vioral sp ecication to its structural description [13 ] [14 ]. It is analogous to a \compiler" whic h translates a high lev el language suc h as C/C++ to an mac hinedep enden t executable 10 PAGE 22 program. High lev el syn thesis nds a registertransfer lev el (R TL) structure that implemen ts the b eha vior while satisfying a set of constrain ts. The constrain ts in HLS include p erformance, area, cost, p o w er, reliabilit y testabilit y etc. Behavior refers to the w a y the system or its comp onen ts in teract with their en vironmen t (mapping from inputs to outputs). Structur e refers to the set of in terconnected comp onen ts that construct the system. High lev el syn thesis is also kno wn as algorithmlev el or b eha viorallev el syn thesis. The inputs to HLS usually consist of the follo wing: The b eha vioral sp ecication of the system Constrain ts suc h as cost, p erformance, p o w er, etc. The optimization function A mo dule library represen ting the a v ailable comp onen ts at R TL. And the outputs are comp osed of: R TL implemen tation structure (netlist) Con troller whic h is usually captured as a sym b olic FSM Geometrical information 1.4.2 Motiv ation of High Lev el Syn thesis Generally decisions made at the higher lev el of a design ha v e greater impact on the design than the decisions made at lo w er lev els. High lev el syn thesis is b ecoming p opular due to the follo wing reasons [13 ]: Shorter design cycle : If more of the design pro cess is automated, the pro duct can b e a v ailable to the mark et faster at a lo w er cost. F ewer err ors : If the syn thesis pro cess can b e v eried correctly the p ossibilities of ha ving errors will b e few er. 11 PAGE 23 A bility to se ar ch the design sp ac e : The syn thesis system pro duces m ultiple designs for the same sp ecication in a short time. The designers ha v e the options to c ho ose the design with dieren t tradeos suc h as cost, sp eed, p o w er, etc. Selfdo cumenting design pr o c ess : An automated system can k eep trac k of the design decisions and the eects of the decisions. A vailability of IC te chnolo gy to mor e p e ople : More design exp ertise is mo v ed in to syn thesis system, it b ecomes easier for a nonexp ert to pro duce a c hip that meets a giv en set of sp ecications. 1.4.3 Phases of High Lev el Syn thesis A t ypical high lev el syn thesis to ol consists of the follo wing phases as sho wn in Figure 1.3. The b eha vior of the system to b e syn thesized is usually sp ecied at the algorithmic lev el Datapath and Control RTL Description Controller Implementation Module Binding and Allocation (VHDL/Verilog) Behavioral Description Operation Scheduling Data Flow Analysis Figure 1.3. T ypical Flo w of High Lev el Syn thesis. 12 PAGE 24 using a hardw are description language lik e VHDL or V erilog. Unlik e generalpurp ose programming languages suc h as C/C++, VHDL/V erilog supp ort concurrency The b eha vioral sp ecication is transformed in to graphical represen tation whic h is a data ro w graph (DF G) and con trol ro w graph (CF G). The data ro w graph is a directed graph indicating the data mo v es, and the con trol ro w graph is a directed graph represen ting the sequence of op erations. An example DF G is sho wn in Figure 1.4 (a). During data ro w analysis, sev eral transformations are in v olv ed including: parallelism extraction, eliminating high lev el language constructs, lo op unrolling, and common sub expression detection. Sc heduling is the + B A C D Out A B C D Out + T1T2 (b) (a) Out <= (A+B) (CD) * Figure 1.4. An Example Data Flo w Graph and a Sc hedule. pro cess of assigning arithmetic and logical op erations to con trol steps. A con trol step is dened as the fundamen tal sequencing unit in sync hronous systems and it t ypically corresp onds to a clo c k cycle. The ob jectiv e of sc heduling is to minimize the amoun t of time or the n um b er of con trol steps needed for completion of the program, giv en a certain of constrain ts on the a v ailable hardw are resources. Numerous sc heduling algorithms ha v e b een prop osed b y previous researc hers [15 ] [16 ] [17 ] [18 ] [19 ] [20 ]. These algorithms can b e categorized as: in teger linear programming, as so on as p ossible (ASAP), aslateasp ossible (ALAP), forcedirected sc heduling, and listbased sc heduling, etc. One sc hedule for the DF G sho wn in Figure 1.4 is sho wn in Figure 1.4 (b). 13 PAGE 25 Resource allo cation refers to the pro cess of determining the t yp es of hardw are comp onen ts (functional units, storage, and comm unication paths) and the n um b er of eac h t yp e to b e used in the nal implemen tation. Usually resource allo cation can b e divided in to datapath and con trol allo cation. Datapath allo cation refers to op eration selection, register/memory allo cation, in terconnection generation and hardw are minimization. Con trol refers to the selection of con trol st yle (PLA, micro co de, random logic, etc.) and con troller generation. Actually sc heduling and allo cation are closely in terrelated and dep end on eac h other. Binding is the pro cess of assigning op erations to the allo cated hardw are comp onen ts. In this phase, ph ysical mo dules are selected meeting the sp ecication of mo dule parameters and constrain ts. Mean while, con troller implemen tation is generated. Con troller is implemen ted as a nite state mac hine (Mealy or Mo ore mac hine) that con trols the ro w of data in the datapath. The datapath and the con troller comm unicate through a set of registers. In the nal phase of HLS, design output is generated in a form suc h that it can b e understo o d b y the logiclev el syn thesis to ols. The generated output is usually describ ed with a lo w lev el hardw are language suc h as structural VHDL or EDIF [21 ]. 1.5 Dissertation Con tributions The ma jor con tributions of this dissertation include: The lo w p o w er driv en FPGA tec hnology mapping approac h signican tly reduces the dynamic p o w er consumption o v er previous w orks. The extension of the prop osed tec hnology mapping algorithm guaran tees to generate dela y optimal mapping solutions while still reducing the p o w er consumption. The topdo wn design ro w for placemen t is eectiv e to enhance the p erformance of the placed circuit. F orcedirected placemen t sc heme w as mainly used for ASICbased designs. W e sho w that it can b e utilized for FPGA placemen t and the exp erimen tal results are satisfying. 14 PAGE 26 La y out a w are high lev el syn thesis and placemen t algorithm is capable of ecien tly predicting the timing information at the ph ysical design lev el. F urther, it increases the p erformance of the placed circuit. 1.6 Dissertation Ov erview The rest of this dissertation is organized as follo ws: In Chapter 2, w e pro vide the bac kground on VLSI CAD, FPGA ph ysical syn thesis, and discuss related researc h w ork that has b een done in the area of tec hnology mapping and placemen t. In Chapter 3, w e prop ose a lo w p o w er driv en tec hnology mapping framew ork for LUTbased FPGAs. And extension of this w ork is also presen ted so as to generate dela yoptimal mapping solutions while reducing the p o w er consumption as w ell. In Chapter 4, w e presen t a forcedirected p erformance driv en placemen t algorithm for hierarc hical FPGAs. In Chapter 5, w e prop ose a la y out a w are p erformancedriv en placemen t ro w with high lev el syn thesis. In Chapter 6, w e dra w the conclusions and discuss future researc h directions. 15 PAGE 27 CHAPTER 2 BA CK GR OUND AND RELA TED W ORK The Field Pr o gr ammable Gate A rr ay (FPGA) w as rst in tro duced in 1985 b y Xilinx as an alternativ e to applicationsp ecic in tegrated circuit (ASIC) designs. It has b ecome one of the most p opular devices for digital system protot yping and has gro wn in to a m ultibillion industry VLSI system designers can implemen t their designs b y pr og r amming the FPGA c hip in the eld, th us reducing the length y and exp ensiv e fabrication cost. With the p erformance of the FPGA c hips increasing rapidly the computer aided design (CAD) for FPGAs is of great imp ortance curren tly In this c hapter, w e pro vide the bac kground information on CAD for FPGAbased designs and then w e go o v er the related researc h w ork. This c hapter is organized as follo ws: In Section 2.1, w e giv e the bac kground information on FPGA arc hitectures. In Section 2.2, w e presen t the CAD design ro w for FPGA ph ysical design. In Section 2.3, w e describ e the previous w ork on FPGA tec hnology mapping. In Section 2.4, w e review the prior w ork on placemen t for FPGAs. In Section 2.5, w e conclude this c hapter. 2.1 FPGA Arc hitecture An FPGA is an otheshelf VLSI c hip consisting an arra y of congurable logic blo c ks (CLBs), v ertical and horizon tal routing c hannels, and programmable input/output blo c ks. The most p opular FPGA tec hnology used no w ada ys is the static r andomac c ess memory (SRAM) based FPGA. The programmabilit y is ac hiev ed b y using the SRAMs to implemen t programmable logic blo c ks and to con trol programmable routing resources. Due to the fact that SRAM cells can b e rewritten b y the users, the SRAM based FPGAs ha v e the feature called eld r epr o gr ammability The basic logic blo c ks are commonly realized b y a 2 K bit 16 PAGE 28 SRAM cell, whic h represen ts a K input oneoutput lo okuptable ( K LUT). A K LUT can b e programmable to implemen t an y Bo olean function for up to K v ariables. An example of 2input LUT is sho wn in Figure 2.1 (a). Fliprop can b e incorp orated in to a LUTbased logic cell to implemen t sequen tial functions. A negr aine d logic cell with a single riprop is sho wn in Figure 2.1 (b). A c o arsegr aine d logic cell t ypically con tains m ultiple LUTs, m ultiplexers, and riprops. (a) x1 x0 f 0 0 0 1 1 1 0/10/10/1 0/1 Out (b) Clock D Q 0 1 LUT Figure 2.1. (a) An Example of 2input LUT. (b) A Finegrained Logic Blo c k with a Fliprop. The LUTbased FPGAs are pro vided b y sev eral FPGA v endors, suc h as Xilinx [6], Altera [22 ], and A T&T [23 ]. Other than the LUTbased FPGAs, the logic cells can also b e implemen ted b y m ultiplexers whic h is usually comp osed of a tree of 2to1 MUXes. Figure 2.2 sho ws the m ultiplexerbased logic mo dule emplo y ed b y Actel [24 ]. d2 s0s1s2s3 d0d1d3 0 0 0 1 f 1 1 Figure 2.2. Multiplexerbased Logic Mo dule Used b y Actel. 17 PAGE 29 Our researc h w ork is mainly concen trated on LUTbased FPGAs but can b e extended to other arc hitectures as w ell. While the FPGA man ufactures's pro ducts ha v e their o wn unique features, all FPGAs that are commercially a v ailable share some common c haracteristics suc h as an arra y of logic blo c ks and programmable routing resources. Figure 2.3 illustrates the common arc hitecture of the arra ybased FPGA. I/OI/O I/OI/O C CC C C CC C S S S S S S S S C C C CLB CLB CLB CLB C S I/O I/OI/O I/O connection block horizontal routing channel switch box vertical routing channel Figure 2.3. Arc hitecture of an Arra ybased FPGA. The congurable logic blo c ks, denoted as C LB in Figure 2.3, are customizable to implemen t the logic functions. The connection blo c k, denoted as C in Figure 2.3, connect the CLB pins to the routing c hannels. A horizon tal and v ertical routing c hannel are connected via a switc h b o x denoted as S in Figure 2.3. A switc h blo c k is comprised b y a n um b er of programmable switc hes whic h accoun t for the connections of FPGA routing. Usually the switc hes ha v e higher resistance and capacitance, and hence result in signican t dela ys. The routing c hannels are segmen ted in order to balance the circuit p erformance and routabilit y Routing trac ks consist a set of wires with dieren t lengths where longer wires are desirable for timingcritical nets and shorter wires are in tended for short connections to sa v e routing resources. 18 PAGE 30 2.2 CAD Flo w for FPGA Ph ysical Design T ypically a CAD system consists of design to ols for p erforming the follo wing tasks: Design entry pro vides the designers the options to en ter the description of the digital system in the forms of sc hematic or Hardw are Description Language (HDL) co de. Initial synthesis creates an initial circuit according to the data en tered in the design en try phase. F unctional simulation is to v erify the functionalit y of the design. L o gic synthesis and optimization utilizes a certain optimization algorithms to obtain an optimized circuit, based on the optimization ob jectiv es sp ecied b y the designers. Physic al design generates the la y out on the optimized circuit for the giv en target tec hnology (fullcustom, semicustom, FPGA, etc.). Timing simulation computes the exp ected propagation dela y of the implemen ted circuit. Chip c ongur ation congures the c hip to implemen t the designed system. In Figure 2.4, w e sho w the t ypical design ro w of a CAD system. Belo w w e will describ e the details of the CAD ro w. 2.2.1 Design En try Design en try is the starting p oin t of the CAD system where the designers giv e the description of the circuit b eing designed in to the CAD to ol. Curren tly there are t w o ma jor approac hes of design en try: sc hematic capture and hardw are description language (HDL). 1. Sc hematic Capture: { A sc hematic capture to ol allo ws the users to dra w a sc hematic diagram to describ e the circuit using graphical sym b ols. 19 PAGE 31 Design Entry Schematic HDL Initital Synthesis Functional Simulation Design Description Design correct? Logic synthesis/optimization Physical design Timing simulation Design correct? Yes Chip configuration Yes NoNo Figure 2.4. T ypical Design Flo w of a CAD System. { There are builtin libraries whic h con tain the logic gates of dieren t t yp es. The users select the gates in the libraries to use in their sc hematics. { Previously designed circuit or subcircuit can b e created as a sym b ol and hence can b e reused. This mec hanism mak es it p ossible to design the system in a hierarc hical manner. 2. Hardw are Description Languages (HDLs): 20 PAGE 32 { HDLs are similar to other programming languages but are designed to describ e (structural or b eha vioral) the hardw are. VHDL and V erilog are the most common HDLs in use to da y { The most distinct dierence of HDL from other programming languages is its capabilit y of represen ting parallel op eration. { HDLs are more p opular than sc hematic capture for use in design en try and CAD to ols syn thesize the HDL co de in to a hardw are implemen tation of the describ ed digital system. { Adv an tages o v er sc hematic capture: P ortable: A circuit presen ted in HDL format can b e syn thesized and implemen ted b y dieren t CAD to ols and c hips from v arious companies. The HDL description need not to b e c hanged. Reusable: It is easy to share and reuse HDLsp ecied circuits. Hierarc hical design capable: HDL co des can b e describ ed in a mo dular w a y to facilitate hierarc hical design. 2.2.2 Initial Syn thesis Once the design is en tered through design en try the HDL co de or sc hematic diagram is translated in to a logic gate net w ork. The output of the initial syn thesis to ol is a lo w erlev el description of the circuit in an appropriate form for use b y succeeding to ols. It consists of a set of logic expressions whic h describ e the logic functions to b e implemen ted. 2.2.3 F unctional Sim ulation Before the design can b e logic optimized, logic expressions generated b y the initial syn thesis is fed in to functional sim ulation to ol to v erify the functionalit y of the circuit. The designers sp ecify the v aluations of input test v ectors and c hec k the output of the sim ulation to mak e sure that the circuit is functioning correctly as an ticipated. During functional sim ulation, the propagation dela y asso ciated with eac h signal is not b eing considered. 21 PAGE 33 2.2.4 Logic Syn thesis and Optimization Logic syn thesis to ols map the design comp osed of simple gates or describ ed in HDL co de in to an optimized circuit based on the t yp e of logic resources a v ailable in the target device. F or example, if the target device is a LUTbased FPGA, the n um b er of inputs to the logic functions in the circuit cannot exceed the input size of the LUTs. T ypically logic syn thesis can b e divided in to t w o phases: 1. T ec hnology indep enden t logic optimization: In this phase, the circuit is logic optimized without considering the resource a v ailabilit y in the target device. General tec hniques utilized to remo v e redundan t logic and simplify the logic include factoring, decomp osition, and extraction. The most p opular to ols used in academics are: Espresso for t w olev el logic optimization and SIS for m ultilev el circuit optimization. They are b oth dev elop ed b y UC Berk eley Commercial CAD to ols suc h as Synopsys, Cadence Design Systems, and Men tor Graphics all ha v e their o wn logic optimization mo dules. 2. T ec hnology mapping: T ec hnology mapping is to con v ert the original tec hnology indep enden t Bo olean netw ork in to a functionally equiv alen t net w ork suc h that it can b e implemen ted b y the target device. F or example, for standard cell based designs, tec hnology mapping is to map pieces of the original net w ork with the logic cells a v ailable in the standard cell library And for LUTbased FPGA tec hnology mapping, the target device cell library is simply the LUTs. W e will discuss the details of the algorithms on tec hnology mapping for LUTbased FPGAs in Section 2.3. Generally the optimization ob jectiv es of tec hnology mapping include: { minimizing the total area of the cells needed co v ering the original net w ork. { minimizing the maxim um circuit lev el of the mapp ed net w ork. { optimizing the routabilit y of the mapp ed netlist. { minimizing the total p o w er consumption of the mapp ed circuit. 22 PAGE 34 The general design ro w of logic syn thesis is sho wn in Figure 2.5. Logic Synthesis Technology Mapping Logic Optimization Input circuit Optimized NetworkOptimized circuit Figure 2.5. Logic Syn thesis Design Flo w. 2.2.5 Ph ysical Design Ph ysical design is a pro cess of transforming the netl ist (a structural description) in to the l ay out (a geometric represen tation). La y out is deriv ed b y con v erting the logic comp onen ts (cells, macros, gates, transistors) in to a geometric represen tation with sp ecic shap es in m ultiple la y ers. The exact details of the la y out also dep end on design rules whic h are based on the electrical prop erties of the fabrication materials and the constrain t on the fabrication pro cess. Ph ysical design determines where to put the logic comp onen ts, ho w to deal with the in terconnect, etc. The ob jectiv es of ph ysical design include small area, high p erformance, lo w p o w er consumption, feasibilit y etc. In curren t DSM design regime, ph ysical design has b ecome a v ery c hallenging task b ecause the circuits are comp osed of millions of transistors and in terconnect, m ultiple ob jectiv es and constrain ts at the same time. Consequen tly computer aided design (CAD) is v ery imp erativ e for the c hip designers. The ph ysical design ro w is sho wn in Figure 2.6. 23 PAGE 35 Floorplanning and Placement Partitioning Routing Layout Compaction Extraction and Verification Circuit Netlist Physical Design Fabrication Figure 2.6. Ph ysical Design Flo w. 2.2.5.1 P artitioning VLSI systems are b ecoming more and more complex and mak es it almost imp ossible to implemen t a whole system on a single c hip. And to o large c hip area will sev erely h urt the yield rate whic h in turn increases the cost of c hips. So the system is usually partitioned in to subsystems (blo c ks). This giv es the designers the rexibilit y of c hanging part of the system without w orrying ab out the other parts. Subsystems can b e designed indep enden tly and sim ultaneously to exp edite the en tire designing pro cess. Since the subsystems' sizes are smaller, the designed complexit y is reduced as w ell. F or large circuits, the partitioning pro cess is carried on in a hierarc hical manner. F or example, system lev el partitioning is executed, follo w ed b y b oard lev el partitioning and nally c hip lev el partitioning. During the pro cess of partitioning, sev eral factors are considered suc h as the size of the blo c ks, 24 PAGE 36 the n um b er of the blo c ks, and the n um b er of in terconnections b et w een the blo c ks. Some fundamen tal tec hniques used in partitioning include: KernighanLin (KL) algorithm [25 ], FiducciaMattheyses (FM) algorithm [26 ], Sanc his algorithm [27 ], and Sim ulated Annealing [28 ]. 2.2.5.2 Flo orplanning and Placemen t Once the circuit is partitioned, the area of eac h subcircuit can b e estimated, the p ossible shap es of the blo c ks can b e ascertained, and the n um b er of pins needed b y eac h blo c k is also kno wn. In addition, area o v erhead for routing needs to b e tak en in to accoun t. Both ro orplanning and placemen t determine the blo c k p ositions suc h that the area, total wire length, dela y and routabilit y for the blo c ks are optimized. T ypically ro orplanning is carried on prior to placemen t. It is a planning step to deal with hierarc hical design. In this phase, shap es of blo c ks (soft blo c ks) are not xed, and pin assignmen t is not nalized. F or placemen t, the shap es of blo c ks are xed (hard blo c k), and pin assignmen t is xed to o. F or full custom design, ro orplanning is follo w ed b y placemen t. While for standard cell design, ro orplanning is the same as placemen t b ecause the shap e, pins of the cell are predened. W e will elab orate the algorithms for ro orplanning and placemen t in Chapter 4. 2.2.5.3 Routing The routing pro cess is to complete the in terconnections b et w een all blo c ks according to the sp ecied netlist. Routing resources including wires and switc h b o xes are lo cated in horizon tal and v ertical regions b et w een blo c ks called r outing channels Generally routing is done in t w o phases, referred to as the Glob al R outing and the Detaile d R outing In global routing, eac h net is assigned to particular routing regions without sp ecifying the actual geometric la y out of wires and pins. This is sho wn in Figure 2.7 (a). In detailed routing, the router determines the exact geometric la y out of eac h net within the assigned routing regions as sho wn in Figure 2.7 (b). Due to the limited routing resources a v ailable on a c hip, in man y cases complete routing of all the in terconnections cannot b e guaran teed. Therefore, 25 PAGE 37 ripup and r er oute is used to remo v e connection in congested area and reroute them in a dieren t order. Global Routing (a) Detailed Routing (b) Figure 2.7. Global and Detailed Routing. 2.2.5.4 Compaction After the detailed routing is done, the la y out is ready for fabrication. La y out compaction is the task of reducing the c hip area. It is necessary b ecause the place and route to ol ma y not generate optimal la y outs. Some v acan t space ma y exist in the la y out. By mo ving ob jects closer to eac h other, area and wire lengths are decreased. And smaller c hip area implies that more c hips can b e pro duced on a w afer, whic h in turn reduce the cost of man ufacturing. Note that compaction should not violate an y design rules. 2.2.5.5 Extraction and V erication Before the c hip is man ufactured, design rule c hec king (DR C) is imp osed to v erify that all geometric patterns satisfy the design rules for the fabrication pro cess. Once the design rules and c hec k ed and violations are remo v ed, circuit extraction is executed to v erify the functionalit y of the la y out. It generates the circuit represen tation from the la y out and the extracted description is compared with the original circuit description to v erify its correctness. This pro cess is kno wn as L ayout V ersus Schematics (L VS) v erication. Mean while, 26 PAGE 38 it also calculates accurate timing of the comp onen ts including in terconnect to v erify the p erformance. The extracted information is also used to c hec k the reliabilit y of the la y out to ensure that the la y out will not fail due to electromigration, selfheat, and other eects [29 ]. 2.2.6 Timing Sim ulation Timing sim ulation is p erformed once the ph ysical design is completed for the c hosen tec hnology device. It is to v erify the circuit implemen ted in the target tec hnology satises the an ticipated timing requiremen t. Timing sim ulation sim ulates the actual propagation dela ys in the target tec hnology Hence, it giv es a fairly precise indication of the p erformance b efore the c hip is nally congured and fabricated. 2.3 T ec hnology Mapping for LUTbased FPGAs Logic syn thesis for LUTbased FPGAs con v erts net w ork of logic gates in to functionally equiv alen t K LUT net w ork. This pro cess is generally divided in to t w o phases: lo gic optimization and te chnolo gy mapping Logic optimization reduces the complexit y of the net w ork based on a cost function. This is t ypically done b y remo ving redundancies and common subexpressions to reduce the circuit size or b y resyn thesizing the critical paths to reduce the circuit dela y F or tec hnology mapping, gate de c omp osition and LUT mapping are carried out. Large gates are decomp osed in to gates with at most K inputs (i.e., K b ounded). The K b ounded net w ork is then co v ered to K LUTs in the LUT mapping phase. Note that LUTs ma y o v erlap, whic h means the o v erlapp ed p ortion of logic will b e duplicated in to eac h of these o v erlapping LUTs. If no logic duplication is allo w ed, the mapping is called a duplic ationfr e e mapping F or example, for the Bo olean net w ork sho wn in Figure 2.8 (a), a duplicationfree mapping is sho wn in Figure 2.8 (b) and a general mapping allo wing o v erlapping is giv en in Figure 2.8 (c). Note that the separation of optimization and mapping is articial. Some LUT syn thesis algorithms suc h as [30 ] [31 ] decomp ose collapsed net w orks directly in to LUT net w orks. 27 PAGE 39 ( a ) ( b ) ( c ) Figure 2.8. V arious LUT Mappings for a Bo olean Net w ork ( K = 4): (a) Original Bo olean Net w ork; (b) Duplicationfree Mapping; (c) Mapping with Ov erlapping LUTs. Previous LUTbased FPGA tec hnology mapping algorithms can b e broadly classied according to their primary optimization ob jectiv es: algorithms that minimize area (i.e., minimizing the n um b er of LUTs) [32 ] [33 ] [34 ] [35 ] [36 ] [37 ], algorithms that minimize dela y (i.e., minimizing the n um b er of LUTs on the longest path) [38 ] [39 ] [40 ] [41 ], algorithms that fo cus on routabilit y [42 ] [43 ], and algorithms that minimize b oth area and dela y [44 ] [45 ] [46 ]. 2.3.1 T ec hnology Mapping for Dela y Optimization The dela y of a LUT net w ork can b e estimated b y the n um b er of lev els (or depth ) in the net w ork using the unit dela y mo del. F or depth and general static dela y minimization, mapping for eac h no de can b e optimized indep enden tly without w orrying ab out logic sharing. This is so, b ecause, logic can b e duplicated whenev er needed. Therefore, the depth optimal mapping of no de v only dep ends on the mapping of no des in the subnet w ork ro oted at v denoted as N v Because a mapping of N v consists of LU T v and a mapping of N v LU T v and optimal mapping of N v selects the \b est" LU T v to minimize the dela y of the optimal mapping of N v LU T v using dynamic programming. Usually a l abel for eac h no de in a top ological order is assigned to lead the dynamic programming pro cedure 28 PAGE 40 and decides LU T v for eac h no de v Dela yorien ted FPGA mapping algorithms can b e classied in to t w o classes. The rst class of algorithms, suc h as Chortled [47 ], D A GMap [38 ], and Flo wMap [39 ], p erform LUT mapping without logic resyn thesis. Chortled guaran tees depthoptimal tec hnology mapping for simplegate tree net w orks. It partitions the input in to leafdirected acyclic graphs (leafD A Gs). Eac h leafD A G is mapp ed separately as a tree for area minimization using dynamic programming metho d b y en umerating all p ossible LUT implemen tations of the ro ot no de. In addition, it minimizes area as a secondary ob jectiv e b y using area optimal no de decomp osition along noncritical paths and depth optimal no de decomp osition along critical paths, as w ell as predecessor pac king. D A GMap emplo y ed a classical lab eling algorithm called L aw ler's lab eling [48 ]. La wler's lab eling is a monotonic lab eling pro cedure where the lab els along an y path from a primary input (PI) to a primary output (PO) are nondecreasing with l ( v ) = 0 for an y PI no de v D A GMap has to deal with K b ound input net w ork in order to nd a feasible mapping solution and it do es not guaran tee dela y optimalit y Flo wMap o v ercomes this dra wbac k and guaran tees depthoptimal LUT mapping for general K b ounded net w orks. It form ulates the problem of nding a minimum height Kfe asible cut ( X ; X ) of N v where the heigh t h ( X ; X ) is dened to b e the largest lab el of no des in X The k ey idea is to compute a minim um heigh t K feasible cut for ev ery no de v Flo wMap also uses node spl itting transformation and maximum row c omputation It is applicable to not only K b ounded net w ork but also to an y K mappable net w ork. Flo wMapr [44 ] and CutMap [45 ] extend Flo wMap to reduce the area while k eeping depth optimalit y Flo wMapd [49 ] and EdgeMap [50 ] minimize dela y with a more accurate net dela y mo del. Under dynamic dela y mo dels, the dela y of a net is asso ciated with its structure in the mapping solution, and dieren t branc hes of a m ultifanout no de will in teract. Consequen tly the optimal mapping of v dep ends not only on the optimal mapping of no des in N v but also on that of no des outside N v So dynamic programming cannot b e used with this class of dela y mo dels. It w as sho wn b y Cong and Ding in [49 ] that the dela y optimal mapping problem is NPhard for K 6. This also sho ws that the duplicationfree mapping on general net w orks for dynamic nominal dela y minimization 29 PAGE 41 is NPhard for K 6. A heuristic w as presen ted in [49 ] to incremen tally adjust the static dela y based on the dynamic nominal dela y as the mapping pro cess pro ceeds. The second class of LUT mapping algorithms, suc h as MISpgadela y [51 ], T ec hMapD [52 ] and Flo wSyn [53 ] collapse critical paths follo w ed b y dela yorien ted logic resyn thesis. MISpgadela y is an extension of the UC Berk eley MISI I logic syn thesis system [54 ] to FPGA syn thesis. It tries to collapse eac h critical no de in to its critical fanouts in a top ological order. If suc h a collapse is K feasible, or can b e made K feasible b y decomp osition with increasing the total lev el, it is accepted. This op eration is rep eated un til no more collapse is p ossible. The b est result of all applicable approac hes is selected for the nal mapping solution. T ec hMapD is a com bined area and depth minimization algorithm. It uses a cost function that represen ts the tradeo of depth, area, and input size. It also has a placemen t phase, but is p erformed separately after mapping without resyn thesis. Flo wSyn impro v es Flo wMap from another angle. It targets at further enhancemen t of the depth minimization b y incorp orating logic optimization in to the tec hnology mapping pro cedure. In general, this class of algorithms could ac hiev e mapping solutions with smaller dela y the optimal depth obtained with Flo wMap due to resyn thesis. But they suer from longer computation times. 2.3.2 T ec hnology Mapping for Area Minimization Unlik e dela y minimization mapping, area minimization cannot b e carried out indep enden tly in eac h subnet w ork N v since the LUT sharing among o v erlapp ed subnet w orks m ust b e tak en in to accoun t. It w as sho wn b y Levin and Pin ter [55 ] that for K =4 the problem of determining whether to duplicate a no de of m ultiple fanouts or implemen t it as a LUT for areaoptimal mapping is NPhard. This result w as further generalized b y F arrahi and Sarrafzadeh [33 ] to K 5. This implies that duplicationfree mapping alone will not ac hiev e optimalit y There are t w o main dimensions in the solution space: to select a subset of no des to b e implemen ted b y LUTs, or to select a K feasible cone to b e co v ered b y its LUT implemen tation. W e further classify the mapping tec hniques for area minimization as: 30 PAGE 42 No de sele ction b ase d enumer ation : No de selection based en umeration generates all no de subsets for LUT implemen tation, and for eac h c hosen subset determines the LUT implemen tation of eac h no de. T ec hniques prop osed in [45 ] [49 ] [56 ] fall in to this class. No de c overing b ase d enumer ation : No de co v ering based en umeration rst generates all LUT implemen tations of the no des, and then selects a subset of them to implemen t. This w as prop osed b y Murgai et. al. in [57 ]. F or eac h no de v it creates all p ossible LUT implemen tations (called super nodes ) of v Then, it selects a subset of the sup erno des under the condition that if one sup erno de is selected eac h of its inputs m ust b e a primary input or generated b y another selected sup erno de. Inte ger line ar pr o gr amming : In teger line programming form ulation w as prop osed b y Cho wdhary and Ha y es [58 ] for area minimization. Eac h no de v is asso ciated with a v ariable e ( v ) 2 f 0 ; 1 g where e ( v ) = 1 indicates v is visible in the mapping solution. The ob jectiv e is to minimize e ( v ) whic h is the total n um b er of LUTs in the mapping solution, under a set of linear constrain ts that sp ecify the LUT size constrain ts, and LUT size ev aluation with consideration of recon v ergen t paths in the net w ork. No de sele ction b ase d heuristics : Genetic algorithm is used in no de selection. A no de subset is represen ted b y a bit string where eac h bit represen ts a no de. If the v alue of the bit is 1, it implies that this no de is selected. This metho d w as used in [59 ]. No de c overing b ase d heuristics : This class of heuristics can b e regarded as appro ximation to the no de co v ering based en umeration. Instead of en umerating al l p ossible sup erno des of v the heuristic only pro duce a subset of sup erno des trying to pac k as man y no des in to eac h sup erno de as p ossible or sharing as man y inputs among the sup erno des as p ossible. The pac king based approac h w as in tro duced in [60 ] kno wn as f l ow pack It w as later impro v ed in cutmap [45 ] whic h computes a minimum c ost K feasible cut. The cost of a cut ( X ; X ) is dened as the sum of the costs of no des in input ( X ). Cutmap assigns lo w er costs to no des that are predicted to b e implemen ted 31 PAGE 43 b y LUTs. Once a highcost no de is implemen ted b y a LUT, its cost will b e lo w ered. This algorithm is also able to generate dela y optimal mapping solutions. 2.3.3 T ec hnology Mapping for Routabilit y and Lo w P o w er Compared to mapping algorithms in area/dela y optimization, v ery limited w ork has b een rep orted on routabilit y driv en tec hnology mapping. Sc hlag et. al. prop osed an approac h in [42 ] using a heuristic cost function to guide the mapp er. Another algorithm rep orted b y T ogo w a et. al. in [43 ] com bines mapping with placemen t and routing. Recen tly with the increasing p opularit y of wireless devices and p ortable computers, reducing p o w er consumption has b ecome an imp ortan t issue. Ho w ev er, relativ ely few w ork has b een done on minimizing p o w er consumption in LUTbased tec hnology mapping. In one of the w orks on lo wp o w er tec hnology mapping for LUTbased FPGAs F arrahi and Sarrafzadeh [2] in tro duced a lo wp o w er driv en mapping algorithm at the exp ense of increase in depth and the n um b er of LUTs. W ang et. al. [1 ] presen ted another algorithm to reduce the p o w er consumption b y en umerating a predened n um b er of cuts and c ho osing the one with smallest p o w er consumption as the nal mapping. Li et. al. prop osed sev eral algorithms [61 ] [62 ] [63 ] to reduce the p o w er consumption b y computing lo wp o w er K feasible cut. An extension of their w ork [63 ] also guaran tees optimal dela y while ac hieving p o w er reduction. 2.4 FPGA Placemen t In curren t deep submicron (DSM) regions, designs often ha v e o v er a million logic comp onen ts. Placemen t is one of the most p ersisten t steps in design automation as it directly denes the onc hip in terconnects whic h ha v e b een the b ottlenec k to determine the system p erformance. Therefore, due to the dominance of in terconnect dela y in DSM tec hnology placemen t has b ecome the ma jor factor to aect timing [64 ]. Essen tially in the pro cess of placemen t, all logic units are placed in suc h a w a y that the design can b e completely routed, while satisfying a certain n um b er of constrain ts or optimization ob jectiv es. The 32 PAGE 44 placemen t problem is in nature a v ery dicult problem. F or example, the placemen t of blo c ks in order to minimize the total wirelength is an NPcomplete problem [65 ]. Ov er the past y ears, a large n um b er of heuristic algorithms ha v e b een dev elop ed for solving the placemen t problem with a set of optimization ob jectiv es. Generally the optimization ob jectiv es of placemen t algorithms include: minimizing the wiring (wirelengthdriv en) [66 ] [67 ] [68 ], balancing the wire densit y (routabilit ydriv en) [69 ] [70 ], and maximizing the circuit sp eed (timingdriv en) [71 ] [72 ] [73 ] [74 ]. 2.4.1 Optimization Ob jectiv es of Placemen t La y out design consists of pl acement follo w ed b y r outing Therefore, an acceptable placemen t rst has to b e routable within the giv en la y out area. Since the routing information will not b e a v ailable un til the placemen t is done, estimation is used during placemen t. The sp eed of estimation has an imp ortan t eect on the p erformance of the placemen t algorithm. It m ust b e as quic k as p ossible and the estimation error m ust b e the same for all nets, i.e., it cannot b e sk ew ed. Sev eral commonly used tec hniques for estimation of wirelength are giv en b elo w. 2.4.1.1 Estimation of Wirelength The most common assumption in estimating the total wirelength is that routing uses Manhattan geometry whic h means routing trac ks can only b e either horizon tal or v ertical. F or a t w o pin net connecting blo c k i and blo c k j the Manhattan distance of this net is r ij + c ij where r ij and c ij are the n um b er of ro ws and columns separating these t w o blo c ks. F or m ultipin nets, v arious estimation tec hniques are dev elop ed. Halfp erimeter: This is the most widely used estimation metho d and it is ecien t. It rst nds the smallest b ounding rectangle that encloses all the nets to b e connected. The estimated wirelength is half the p erimeter of this b ounding b o x. F or circuits ha ving only t w o and three pin nets and assuming no detours in actual routing, this 33 PAGE 45 metho d pro vides the b est estimation. F or hea vily congested c hips, ho w ev er, this sc heme tends to underestimate the actual wirelength. Complete Graph: Eac h npin net is represen ted as a complete graph whic h has n ( n 1) 2 edges. Since a tree has ( n 1) edges whic h is 2 n times the n um b er of edges in the graph, the estimated wirelength is computed using Equation 2.1: W ir el eng th = 2 n X 8 pair 2 net ( pair distance ) (2.1) Steiner T ree Appro ximation: A Steiner tr e e is the shortest route for connecting a set of pins. A wire can branc h from an y p oin t along its length to connect to other pins of the net. The problem of nding the minim um Steiner tree is NPcomplete. So appro ximation algorithm suc h as Lee algorithm [75 ] is used to nd an appro ximate steiner tree b y propagating a w a v e for the en tire net. Minim um Spanning T ree: In a minim um spanning tree, branc hing is only allo w ed at the pin lo cations. There are algorithms to nd the minim um spanning tree in p olynomial time suc h as Krusk al's algorithm and Prim's algorithm [76 ]. In Figure 2.9, w e sho w some examples to illustrate the wirelength estimation tec hniques discussed ab o v e. 2.4.1.2 Minimize T otal Wirelength T o place a circuit whic h is routable, the area used b y the routing wires should b e minimized. One common approac h to ac hiev e this is to place strongly connected nets close to eac h other. The corresp onding ob jectiv e function to minimize the total w eigh ted wirelength o v er all nets, L ( P ), is computed as sho wn in Equation 2.2. L ( P ) = X n 2 N w n d n (2.2) 34 PAGE 46 5 6 (a) Halfp erimeter length=11 8 5 8 5 3 6 (b) Complete graph length 2/n = 17.5 8 4 (c) Steiner tree length=12 3 5 5 (d) Minim um spanning tree length=13 Figure 2.9. Dieren t T ec hniques for Wirelength Estimation. 35 PAGE 47 where, d n is the estimated wirelength of net n w n is the w eigh t of net n and N is the total n um b er of nets. Using this metho d, the length of eac h net is computed indep enden tly Hence it is not v ery accurate. 2.4.1.3 Minimize Maxim um Densit y The routabilit y of a placemen t can b e estimated b y the density denoted as D ( P ). F or eac h routing c hannel, there exists a maxim um n um b er of wires that can pass through this c hannel. W e call this n um b er c ( e i ) as the r outing c ap acity for this c hannel. Giv en a placemen t P let n ( e i ) denote the estimated n um b er of nets passing through edge e i of a c hannel. Then the densit y of edge e i is dened as: d ( e i ) = n ( e i ) c ( e i ) (2.3) T o ha v e a routable placemen t, d ( e i ) cannot exceed 1. The routabilit y of the placemen t is then giv en in Equation 2.4. D ( P ) = max i [ d ( e i )] (2.4) where the maxim um v alue of d ( e i ) is among all edges e i in the la y out area. 2.4.2 Placemen t Approac hes As w e ha v e men tioned previously the placemen t problem is NPcomplete. This is not aected once cost functions and constrain ts are tak en in to consideration. F or curren t large digital system designs, the solution space is so large that en umerativ e tec hniques are practically prohibitiv e. Therefore, dev eloping heuristic metho ds whic h ha v e short execution time and generate g ood solution is the most appropriate c hoice for researc hers and CAD to ol engineers. A heuristic algorithm is either constr uctiv e or iter ativ e As the name implies, a constructiv e placemen t approac h constructs a solution b y placing one blo c k at a time. A t the end of eac h step, w e ha v e a p artial plac ement and there are t w o decisions to b e made: (i) whic h unplaced blo c k should b e selected and added to the partial placemen t? and (ii) 36 PAGE 48 where should this blo c k b e placed? V arious heuristics are adopted to mak e the decisions. T ypically constructiv e placemen t tec hnique is g r eedy A t eac h step, it mak es the b est p ossible selection. This metho d do es not guaran tee optim um solution b ecause eac h decision is made in the absenc e of c omplete information The constructiv e placemen t algorithms in use to da y mainly use one of the follo wing three approac hes: mincut (partitioningbased), analyticbased (n umericalbased), and sim ulated annealing based. In the next subsections, w e will pro vide more details on these approac hes. 2.4.2.1 P artitioningbased Placemen t Algorithms P artitionedbased placemen t algorithms rep eatedly partition the circuit in to t w o circuits. Mean while, at eac h lev el of partitioning, the a v ailable la y out area is divided in to horizon tal and v ertical subsections alternately Eac h of the circuits is assigned to a subsection. This pro cess is carried out un til ev ery circuit consists of a single blo c k and has a unique p osition on the la y out area. During partitioning, the n um b er of nets that are cut b y the partition is usually minimized. Consider a rectangular la y out space as sho wn in Figure 2.10. The v ertical line, denoted as C ut v dividing the la y out area in to top region and b ottom region is called a v er tical cut. The horizon tal line, denoted as C ut h dividing the la y out area in to left region and righ t region is named a hor iz ontal cut. The cut size of a cut is dened as the n um b er of nets that cross the cutline. F or example, in Figure 2.10, C ut h has a cut size of 3 and C ut v has a cut size of 4. F or a giv en placemen t P let H ( P ) denote the maxim um cut size among all horizon tal cuts, and let V ( P ) denote the maxim um cut size among all v ertical cuts. By minimizing H ( P ) and V ( P ), the routabilit y of a gatearra y placemen t will b e impro v ed and hence impro ving the wirelength and timing. P artitioningbased placemen t w as rst prop osed b y Breuer [77 ] [78 ]. F ollo wing w ork include [69 ] [79 ] [80 ] [81 ]. Sev eral ob jectiv e functions emplo y ed b y these algorithms are giv en b elo w: 37 PAGE 49 C ut v C ut h Figure 2.10. V ertical and Horizon tal Cutlines for a Placemen t. 1. T otal netcut: All nets cut b y the partitioning (horizon tal and v ertical cut lines) are considered. It is sho wn that minimizing the sum is equiv alen t to minimizing the halfp erimeter wirelength [77 ] [78 ]. 2. Minmax cut v alue: The ob jectiv e is to minimize the n um b er of nets cut b y the cut line across the c hannel. This reduces the c hannel congestion whic h leads to smaller c hannel width and c hip area. Some nets ha v e to b e routed with detours using less condensed c hannels. 3. After eac h partition, the n um b er of net cuts is minimized. It is basically a greedy approac h so it ma y not minimize the total n um b er of nets cut. In addition to the ab o v e ob jectiv e functions, sev eral placemen t pro cedures in whic h v arious sequences of cut lines are dev elop ed. Cut orien ted mincut placemen t: The circuit is rst cut b y a partition in to t w o circuits suc h that the net cut is minimized. All the circuits are further partitioned b y the second cut line and this pro cedure is carried out for all cut lines. This metho d is sequen tial and easy to implemen t. It is illustrated in Figure 2.11 (a). 38 PAGE 50 Quadrature placemen t: Eac h region is partitioned in to four regions of equal sizes b y applying horizon tal and v ertical cut lines alternativ ely as sho wn in Figure 2.11 (b). During eac h partitioning, the cutsize of the partition is minimized. This pro cedure reduces the routing densit y in the cen ter of the la y out area. It is the most p opular sequence of cut lines for mincut based algorithms. Bisection placemen t: The la y out area is rep eatedly bisected (partitioned in to t w o equal parts) b y horizon tal cut lines un til eac h subregion consists of only one ro w. Then, these ro ws are bisected with v ertical cut lines till eac h subregion con tains only one slot. This pro cedure is mainly used in standard cell placemen t and it is sho wn in Figure 2.11 (c). Slice bisection placemen t: A n um b er of blo c ks are partitioned from the rest of the circuit using horizon tal cut line and assigned to a ro w, called as a sl icing This is rep eated un til ev ery blo c k has b een assigned to a ro w. Then the blo c ks are assigned to columns b y bisecting with v ertical cut lines. This metho d is most suitable for circuits ha ving a high degree of in terconnection at the p eriphery Figure 2.11 (d) sho ws the sequence of cut lines for this pro cedure. Ho w ev er, la y outs deriv ed b y solely partitioning blo c ks and assigning them to regions are not go o d enough. The main reason is that this pro cedure do es not tak e in to accoun t the lo cation of external pins, the probable lo cations of blo c ks in the nal placemen t and signals whic h en ter a group of blo c ks. A metho d called terminal pr op agation w as dev elop ed b y Dunlop and Kernighan [79 ]. In this w ork, dumm y terminals are generated suc h that external pin p ositions are tak en in to accoun t while computing the mincut. 2.4.2.2 Analyticbased Placemen t Algorithms The placemen t problem can often b e transformed in to an analyticbased optimization problem. Then it is reduced to the problem of solving a set of sim ultaneous linear equations to determine the equilibrium p ositions ( ide al xy c o or dinates ) for the logic blo c ks. This class 39 PAGE 51 1 2 4 3 13 2 4 5a 1 3 2a 2b 4a 4b 5b ( a ) Cut oriented mincut ( b ) ( c ) ( d ) Slice bisection Bisection mincut Quadrature mincut 1 2 3a 3b 4a4b Figure 2.11. Dieren t Sequences of Cut Lines for Mincut Placemen t. of algorithms are usually based on a quadr atic wir elength ob jectiv e. The rst quadratic placemen t algorithm w as prop osed b y Hall in [82 ]. The cost of connecting a pair of blo c ks B i and B j denoted b y c ij is giv en in a connectivit y matrix C = [ c ij ]. Then minimizing the quadratic wirelength is equiv alen t to minimizing the total wirelength of the circuit whic h is sho wn in Equation 2.5. W ir el eng th = 1 2 n X i;j =1 c ij [( x i x j ) 2 + ( y i y j ) 2 ] (2.5) where ( x i ; y i ) are the co ordinates of the logic blo c k i on the la y out area. The quadratic wirelength ob jectiv e function can b e rewritten using matrix notation as in Equation 2.6. W ir el eng th = x T B x + y T B y (2.6) 40 PAGE 52 where matrix B = [ b ij ] has en try b ij as giv en in Equation 2.7: B = [ b ij ] = 8><>: c ij i 6 = j; P nj =1 c ij O ther w ise (2.7) A trivial solution for the ab o v e ob jectiv e function is: x i = y i = 0 for all i Apparen tly this is not what w e w an t to ha v e since ev ery blo c k will b e placed at the same lo cation. It w as pro v ed in [82 ] that a non trivial solution is deriv ed and the total wirelength is minimized if the smallest eigen v alues of the matrix B are c hosen. The corresp onding eigen v ectors X and Y then giv e the co ordinates of all the blo c ks. This tec hnique is further impro v ed in [66 ] [83 ] [84 ] [85 ]. 2.4.2.3 Sim ulated Annealing Placemen t Algorithms Sim ulated annealing is applied in placemen t as an iterativ e impro v emen t algorithm and it is one of the most sophisticated placemen t tec hniques in use curren tly [28 ] [68 ] [71 ] [86 ] [87 ]. The placeandroute pac k age named Tim b erW olf dev elop ed b y Sec hen [86 ] w as the earliest pac k age to use sim ulated annealing to the placemen t problem. In sim ulated annealing, a series of random mo v es (mo v e one blo c k to another lo cation, sw ap the lo cations of t w o blo c ks, etc.) are executed on an existing solution. Mo v es that result in a decrease in cost are accepted. Cost is often dened as the total wirelength or area required for the placemen t. Mo v es that result in an increase in cost will b e accepted with a probabilit y whic h reduces o v er the iterations. This is useful b ecause it can jump out of lo c al minimal whic h are v ery common in real designs. A parameter kno wn as temp er atur e T is in tro duced to con trol the probabilit y The probabilit y of accepting a bad mo v e decreases along with the decrease in temp erature T The acceptance probabilit y is giv en b y e C T where C is the increase in cost. Sim ulated annealing based algorithms start with a v ery high temp erature whic h gradually decreases so that mo v es that increase cost ha v e lo w er probabilit y of b eing accepted. The algorithm terminates once the temp erature is b elo w a 41 PAGE 53 Algorithm: Sim ulated Annealing Based Placemen t 1 b egin 2 P = RandomPlacemen t(); 3 T = InitialT emp erature(); 4 while (ExitCriterion() == F alse) do 5 while (InnerLo opCriterion() == F alse) do 6 P 0 = NewPlacemen t( P ); 7 cost ( P 0 ) cost ( P ); 8 if 0 then P P 0 9 else if (Random(0,1) > e T ) then P P 0 10 end while; 11 T = Co olingSc hedule( T ); 12 end while; 13 return P ; End algorithm ; Figure 2.12. Outline of Sim ulated Annealing Placemen t Algorithm. predened threshold and it con v erges to an optimal or suboptimal solution. The outline of sim ulated annealing based placemen t algorithm is illustrated in Figure 2.12. The parameters and functions used in the sim ulated annealing algorithm determine the qualit y of the placemen t generated. F or example, the co oling sc hedule con trols the sp eed the temp erature decreases. If the temp erature decreases slo wly the qualit y of the placemen t will b e generally b etter but this leads to longer running time. On the other hand, if the co oling sc hedule is designed in suc h a w a y that the algorithm con v erges quic kly the execution time can b e sa v ed but at the exp ense of p ossible w orse placemen t. Therefore, v arious co oling sc hedules ha v e b een dev elop ed to ac hiev e a b etter tradeo. The adv an tage of sim ulated annealing based placemen t algorithms is the ease of adding new optimization ob jectiv es or constrain ts. But the dra wbac k is often v ery undesirable whic h is the excessiv e running times needed. 42 PAGE 54 2.4.2.4 Summary of Dieren t Placemen t Algorithms F or large problems, analytical placemen t approac h is preferred b ecause it is fast. It usually uses a quadratic wirelength ob jectiv e whic h can b e minimized v ery ecien tly It is t ypically emplo y ed on a rat sc heme in order to sustain the global view of the design. One dra wbac k of analytical based placemen t algorithms is that it tends to create large amoun t of o v erlaps and the quadratic ob jectiv e function is only a coarse indicator of the wirelength. F or sim ulated annealing and partitioning based algorithms, a hierarc hical metho dology is generally utilized to reduce the problem size and hence sp eed up the resulting algorithms. Recen tly Hu et. al. prop osed a metho d to con v ert a rat design to a hierarc hical approac h b y incorp orating the ne gran ularit y clustering tec hnique [88 ]. In this dissertation, w e fo cus on the FPGA placemen t problem for enhancing timing as w ell as netlength. Belo w, w e will discuss the details of sev eral placemen t algorithms prop osed b y previous researc hers for optimizing timing. Previous timingdriv en placemen t algorithms try to minimize the longest path dela y or maximize the minim um slac k v alue of the circuit. Existing algorithms can b e divided in to t w o categories: p athb ase d algorithms [71 ] [72 ] [83 ] and netb ase d algorithms [73 ] [85 ] [87 ] P athbased algorithms generally minimize the longest path directly and k eeps accurate timing information during optimization. Most of the approac hes in this class are based on mathematical programming tec hniques. But they suer from relativ ely high computational costs and cannot b e in tegrated in to certain placemen t suites. Netbased algorithms usually transform timing information in to either dela y or net w eigh t constrain ts, and emplo y a w eigh ted wirelength minimization. Net weighting is a commonly used approac h in this class. Essen tially more timing critical nets are assigned higher w eigh ts. Netbased placemen t algorithms are more fa v orable curren tly b ecause of: relativ ely lo w complexit y strong rexibilit y and easy implemen tation. With more and more complex timing constrain ts presen t in mo dern digital systems, suc h as m ultiple clo c k domains, m ultiple cycle paths, etc., these adv an tage mak e the netbased approac h more attractiv e. 43 PAGE 55 2.5 Conclusion In this c hapter, w e discussed the details of mo dern VLSI design and computeraided ph ysical design automation. W e ga v e sp ecial emphasis on FPGAbased tec hnology mapping and placemen t. As the logic complexit y and capacit y onc hip ha v e b een increasing constan tly man y problems in VLSI CAD domain are b ecoming more and more c hallenging. Multiple constrain ts (optimization ob jectiv es) in a single design phase are v ery common no w ada ys. F or instance, optimizing timing, minimizing area, and maximizing routabilit y are usually required to b e met sim ultaneously in placemen t. Sometimes, these constrain ts ma y ev en conrict with eac h other (reducing area can h urt routabilit y), whic h mak es the problem more dicult. Ho w ev er, on the other hand, there still remains a large scop e to b e explored in VLSI CAD researc h. According to a recen t w ork presen ted b y Cong et. al. [89 ], the p erformances of the b est existing algorithms in placemen t and partitioning ha v e a signican t gap to reac h the optim um. This fact giv es man y researc hers in VLSI design and CAD great inspiration and condence to study and dev elop more adv anced algorithms to attac k the seemingly tough problems. Through all these eorts, w e exp ect a lot of dicult problems in existence no w to b e solv ed or alleviated so on. 44 PAGE 56 CHAPTER 3 LO W PO WER TECHNOLOGY MAPPING F OR LUTBASED FPGAS In recen t y ears, mobile computing and p ortable devices suc h as cellphone, laptop, PD A, etc., ha v e b een used more and more extensiv ely These devices are p o w er sensitiv e and it mak es p o w er sa ving tec hnology an imp ortan t issue for b oth VLSI c hip designers and CAD to ol dev elop ers/researc hers. In this c hapter, w e study the lo w p o w er tec hnology mapping for LUTbased FPGAs. As w e ha v e sho wn in Chapter 2, tec hnology mapping for LUTbased FPGAs has b een studied extensiv ely for minimizing area, dela y and impro ving routabilit y Ho w ev er, relativ ely few er w ork has b een done in tec hnology mapping for p o w er sa ving. The problem has b een pro v ed to b e NPhard previously Therefore, w e presen t an ecien t heuristic algorithm to generate lo wp o w er mapping solutions. The ma jor con tribution of our algorithm is that while generating a LUT, w e lo ok ahead at the impact of the mapping selection of this LUT on the p o w er consumption of the remaining net w ork. W e c ho ose the mapping that results in the least p o w er consumption. The k ey idea is to generate and select lo wp o w er Kfeasible cuts b y an ecien t incremen tal net w ork ro w computation metho d. Exp erimen tal results sho w that our algorithm reduces p o w er consumption as w ell as area o v er the b est algorithms rep orted in the literature. In addition, w e extend this w ork to compute dela yoptimal p o w er sa ving mappings. W e compute lo wp o w er Kfeasible cuts to generate LUTs co v ering no des on noncritical paths and compute minheigh t Kfeasible cuts to generate LUTs co v ering critical no des. Compared with Cutmap, a dela yoptimal mapp er with sim ultaneous area minimization, w e ac hiev e 14% p o w er sa vings on the a v erage without an y dela y p enalt y This c hapter is organized as follo ws: In Section 3.1, w e presen t the outline of our algorithm and briery review t w o previous lo w p o w er mapping algorithms. In Section 3.2, 45 PAGE 57 w e sho w the problem form ulation. In Section 3.3, w e giv e the p o w er estimation mo del used in our w ork. In Section 3.4, w e prop ose our p o w er sa ving tec hnology mapping algorithm. In addition, w e extend this algorithm suc h that it guaran tees the solution to b e dela y optimal while reducing p o w er consumption. In Section 3.6, exp erimen tal results are giv en. In Section 3.7, w e dra w the conclusions for lo w p o w er tec hnology mapping. 3.1 In tro duction P o w er consumption has b ecome a limiting factor for high p erformance designs and mobile computing. Figure 3.1 trac ks the o v erall p o w er dissipation of an FPGA c hip with pro cess tec hnology [90 ]. The desired p erformance is to maximize the op erating frequency Year 2000 2002 2004 2006 2008 2014 2010 2012Power Dissipation (W) 60 140 1998 200 180 160 120 100 80 40 20 0 Figure 3.1. P o w er Dissipation of FPGAs. satisfying p o w er constrain ts. F or FPGA based designs, minimizing p o w er consumption is esp ecially crucial, b ecause FPGA c hips are less p o w er ecien t than logically equiv alen t ASICs. This is due to fact that a large p ortion of FPGA area utilized b y transistors realizing programmabilit y This dra wbac k hinders FPGA devices from b eing used in lo w p o w er applications. Therefore, it is of great imp ortance to consider reducing p o w er consumption in the phase of tec hnology mapping. 46 PAGE 58 The FPGA tec hnology mapping problem for p o w er minimization has b een pro v ed to b e NPcomplete [33 ]. Therefore, sev eral heuristics ha v e b een prop osed on LUTbased FPGA tec hnology mapping for p o w er reduction [1 ] [2 ] [91 ]. The main idea used is to hide no des with high transition activit y inside LUTs suc h that the o v erall p o w er consumption can b e minimized. In [2 ], F arrahi and Sarrafzadeh presen ted that the p o w er consumed b y a LUT dep ends on the transition densit y and the fanout n um b er of the LUT. They also prop osed an approac h to estimate the total p o w er consumption of a tec hnology mapp ed circuit. One deciency of their algorithm is that it fails to tak e the impact of the fanout n um b er of LUTs on the total p o w er consumption in to accoun t. And it o ccupies more area. In [1 ], W ang et. al. prop osed another algorithm using a \cut en umeration" metho d to generate a predened n um b er of mapping solutions for the subcircuit ro oted at eac h no de. Ho w ev er, it do es not guaran tee that if more n um b er of mapping solutions are tested, the nal mapp ed circuit will consume less p o w er. All these algorithms pa y exp ense at longer critical path dela y As part of this dissertation, w e dev elop ed an algorithm kno wn as PowerMinMap whic h not only reduces p o w er dissipation compared to previous approac hes, but also guaran tees depth optimalit y The main idea is to compute lo wp o w er K feasible cuts to generate LUTs to co v er the giv en circuit. Exp erimen tal results sho w that our algorithm reduces b oth p o w er and area o v er [1 ] and [2] signican tly An extension of this w ork is also implemen ted whic h computes dela yoptimal mappings. This is done b y computing minheigh t K feasible cuts for no des on critical paths and lo wp o w er K feasible cuts for noncritical no des. Our algorithm results in 14% p o w er sa vings without an y dela y p enalt y compared to the dela yoptimal Cutmap algorithm.3.2 Problem F orm ulation A general Bo olean net w ork N can b e represen ted as a directed acyclic graph (D A G) N ( V ; E ), where V is the set of no des and E is the set of directed edges. Eac h no de in the D A G represen ts a logic gate, and a directed edge ( u; v ) exists only when the output of no de 47 PAGE 59 u is an input of no de v A primary input (PI) no de has no incoming edge and a primary output (PO) no de has no outgoing edge. F or a no de v inputs ( v ) denotes the set of fanin no des of no de v A Bo olean net w ork N is Kb ounde d if j inputs ( v ) j K for ev ery no de v in N Giv en a subgraph H of N ( V ; E ), inputs ( H ) denotes the set of distinct no des that pro vide inputs to the no des in H A cone at no de v is a subgraph C v consisting of v and its predecessors suc h that an y path connecting a no de in C v and v lies en tirely within C v A cone C v is Kfe asible if the n um b er of no des feeding in to C v is no larger than K The tec hnology mapping problem can b e also considered as partitioning the D A G represen tation of the circuit in to K feasible cones (equiv alen tly K LUTs). Note that w e allo w these cones to o v erlap whic h means no de duplication is allo w ed. The l ev el of a no de v in a net w ork is the length of the longest path b et w een v and some PI no de. The lev els for all PI no des are zero. The depth of a giv en net w ork is the largest no de lev el in the net w ork. Giv en a K b ounded Bo olean net w ork, let N v denote the subnet w ork consisting of no de v and all its predecessors. A cut ( X v ; X v ) for no de v is a partition of the no des in N v suc h that v is in X v The no de cutsize is dened as the n um b er of no des in X v that are adjacen t to some no des in X v If the cut size is no larger than K it is called a Kfe asible cut. If ( X v ; X v ) is a K feasible cut, w e can co v er all no des in X v b y a single K input LUT. F or example, Figure 3.2 sho ws a 3feasible cut ( X v ; X v ) for no de v and the no de cut set is f g ; h; i g v XX v v a b c d e g k f j i h Figure 3.2. A 3feasible Cut for No de v. 48 PAGE 60 The dela y of a circuit mapp ed in to K LUTs is determined b y t w o factors: the dela y due to the LUT and the dela y due to the in terconnections. The inputs to a LUT ma y ha v e dieren t signal arriv al times, but once the input com bination settles, the LUT tak es constan t time (i.e., access time of the SRAM) to pro duce the output indep enden t of the function it implemen ts. Since at the stage of mapping, the la y out information is not a v ailable, w e assume that eac h edge in the mapping solution has the same dela y Th us the total dela y can b e measured b y the depth of the mapping solution. This unit dela y mo del has b een widely used in previous researc h w ork [38 ] [39 ] [44 ] [45 ]. In this dissertation, w e study LUTbased tec hnology mapping for p o w er minimization. Giv en a K b ounded Bo olean net w ork, w e w ould lik e to nd a mapping solution consisting of only K LUTs in suc h a w a y that the total p o w er consumption of the mapp ed circuit is minimized. W e also study this problem further with the constrain t that the mapp ed circuit should ha v e optimal depth. 3.3 P o w er Estimation Mo del In this section, w e review the p o w er estimation mo del used in [1 ] [2] whic h estimates the p o w er consumption of a mapp ed circuit consisting of LUTs only The ma jorit y of p o w er consumption in CMOS circuits is due to the dynamic p o w er dissipation [12 ]. Dynamic p o w er dissipation P d o ccurs b ecause of the switc hing activit y (either 0 1 or 1 0 transition) of the circuit, whic h results in the c harging/disc harging of the load capacitance. The e quilibrium pr ob ability of a signal v denoted as p ( v ), is dened as the probabilit y that signal v has the v alue 1. And the tr ansition density of a signal v denoted as d ( v ), is dened as the n um b er of times that signal v c hanges its v alue in unit time. The formal denitions are sho wn b elo w: Denition 1. The e quilibrium pr ob ability of x ( t ) denote d as p ( x ) is dene d as: p ( x ) = lim T !1 1 T Z + T 2 T 2 x ( t ) d ( t ) (3.1) 49 PAGE 61 Denition 2. The tr ansition density (TD) of x ( t ) denote d as d ( x ) is dene d as: d ( x ) = lim T !1 n x ( T ) T (3.2) where n x ( T ) is the n um b er of transitions of x ( t ) in the time in terv al ( T 2 ; + T 2 ]. Let y = f ( x 1 ; x 2 ; : : : ; x n ) b e a Bo olean function. The Bo ole an dier enc e of y with resp ect to x i denoted as @ y @ x i is dened as: @ y @ x i = y j x i =0 y j x i =1 (3.3) Let M b e a logic mo dule with inputs x 1 ; : : : ; x m and outputs y 1 ; : : : ; y n If there is no propagation dela y asso ciated with M M is kno wn as a zer odelay logic mo dule. W e quote the theorem presen ted in [92 ] to sho w the relationship b et w een d ( y i ) and d ( x i ). The transition densit y at the output of a gate can b e calculated as follo ws: d ( y ) = n X i =1 p ( @ y @ x i ) d ( x i ) (3.4) Using this theorem w e can compute the transition densit y of an y no de in the net w ork when the transition densities at the primary inputs are giv en. The pro of of this theorem can b e found in [92 ]. This is based on the assumption that the fanins of ev ery no de are indep enden t. Ho w ev er, this ma y not b e alw a ys true. Ev en though this holds for PIs, the existence of recon v ergen t paths ma y cause correlation among the the v alues of the fanins to a no de. It is sho wn in [92 ] that if the mo dules are large enough suc h that tigh tly coupled no des are k ept inside the same mo dule, then the indep endence assumption holds. The gate propagation dela y and the temp oral correlation on the no des are ignored as in [1 ] and [2 ]. Once the transition densit y for ev ery gate has b een computed, the p o w er consumption of a giv en circuit can b e computed as: P d = 1 2 n X i =1 C i V 2 dd d ( i ) (3.5) 50 PAGE 62 where C i is the load capacitance of gate i V dd is the supply v oltage, and d ( i ) is the transition densit y of the output of gate i F or LUTbased FPGAs, the ab o v e form ula still holds except that the transitions happ en only at the input/output of eac h LUT. Giv en a mapping solution, for a LUT ro oted at no de v and with input no des u 1 ; u 2 ; : : : ; u k the p o w er dissipation due to the LUT, P ( v ; f u 1 ; u 2 ; : : : ; u k g ), is giv en b y: P ( v ; f u 1 ; u 2 ; : : : ; u k g ) = K p C out d ( v ) + k X i =1 K p C in d ( u i ) where K p = 0 : 5 V 2 dd is a constan t, C out is the output capacitance of the LUT, C in is the input capacitance of the LUT. No w if w e are giv en a Bo olean net w ork, w e estimate the minim um p o w er dissipation when it is mapp ed in to K input LUTs as follo ws. Dene a LUTco v er ro oted at no de v as a LUT that co v ers all the no des in a K feasible cone ro oted at v Let S v denote the set of all p ossible LUTco v ers ro oted at no de v Let inputs ( L ) denote the set of no des that pro vide inputs to a LUTco v er L in S v W e use the follo wing recursiv e form ula to estimate the minim um p o w er consumption of a LUTsubnet w ork co v ering N v : E P ( v ) = min L 2 S v f P ( v ; inputs ( L )) + X u 2 inputs ( L ) E P ( u ) g (3.6) Note that E P ( v ) is actually an upp er b ound on the minim um p o w er consumption of an y LUTsubnet w ork co v ering N v when there are recon v ergen t fanouts within N v F or an y PI no de u w e assume that E P ( u ) is equal to zero. F or example, for the mapping solution sho wn in Figure 3.3, E P ( j ) = K p C out d ( j ) + K p C in [ d ( a ) + d ( b ) + d ( c )] + E P ( a ) + E P ( b ) + E P ( c ) = K p C out d ( j ) + K p C in [ d ( a ) + d ( b ) + d ( c )] ( as E P ( a ) = E P ( b ) = E P ( c ) = 0 ) andE P ( l ) = K p C out d ( l ) + K p C in [ d ( j ) + d ( k )] + E P ( j ) + E P ( k ) 51 PAGE 63 LUT LUT LUT a b c d e f g h i j k l j l k Figure 3.3. A Mapp ed Net w ork in to 3LUTs Ro oted at l (No des a, b, c, d and e are PI no des). 3.4 P o w er Minimization Algorithm In this section, w e rst presen t our p o w er minimization algorithm called PowerMinMap (PMM) and then discuss its extension named PowerMinMap d (PMMd) whic h optimizes b oth p o w er and dela y sim ultaneously F urther, w e will sho w that PMMd yields dela yoptimal solutions. The input to the algorithm is a K b ounded gatelev el net w ork. If the netlist is not K b ounded then w e will rst decomp ose the net w ork to satisfy the K b ounded prop ert y The equilibrium probabilities and transition densities of the PIs are giv en. The PMM algorithm generates a p o w eroptimized mapping solution and it consists of the follo wing three phases: 1. T ransition densit y propagation. 2. Computation of E P ( v ) for ev ery no de v 3. Mapping generation. In the rst phase, w e compute the transition densit y for all in ternal no des as w ell as PO no des in a top ological order. Note that b y doing so, w e w ould ha v e also propagated equilibrium probabilities. Belo w, w e describ e the details in phases t w o and three. 52 PAGE 64 3.4.1 Phase I I: Computation of EP(v) The EP v alues of all no des are computed in a top ological order starting from the PIs. F or eac h no de v a sequence of Tb ounde d Kfe asible cuts is computed based on the EP v alues of v 's ancestors. E P ( v ) is computed based on the cut that yields the minim um p o w er consumption. A Tb ounde d Kfe asible cut is a cut whose input size is no larger than K suc h that the EP v alue on an y input to the cut is no more than T This term is further explained in detail b elo w. 3.4.1.1 TBounded K F easible Cut While generating a LUT ro oted at no de v w e w ould lik e to minimize the sum of the p o w er consumption of the fanins feeding in to this LUT. Ho w ev er, it has b een pro v ed in [33 ] that the problem of nding a tec hnology mapping solution with minim um p o w er consumption is NPhard. W e prop ose an ecien t metho d to compute a lo wp o w er K feasible cut b y computing a sequence of Tb ounde d Kfe asible cuts using incremen tal net w ork ro w computation.Denition 3. A T b ounde d K fe asible cut ( X v ; X v ) for no de v is a K fe asible cut such that the maximum EP value of the no des in X v that pr ovide inputs to the no des in X v is no lar ger than a thr eshold T Denition 4. A T b ounde d K fe asible cut is a lowp ower Kfe asible cut for no de v if it r esults in the smal lest p ower c onsumption among the se quenc e of T b ounde d K fe asible cuts c ompute d for v F or a giv en nonPI no de v a lo wp o w er K feasible cut is computed as follo ws: W e start with the largest EP v alue in v 's ancestors as the threshold T and use a net w ork ro w based approac h to c hec k if there exists a T b ounded K feasible cut. W e con v ert subnet w ork N v in to a no decapacitated ro w net w ork b y assigning innite ro w capacit y to those no des ha ving EP v alues larger than the threshold and unit ro w capacit y to other no des. Then, w e compute a maxro w in the constructed net w ork. If the v alue of the maxro w computed is 53 PAGE 65 no larger than K w e will rep eat with the next largest EP v alue in v 's ancestors as the new threshold v alue, so on and so forth. Eac h time w e nd a new K feasible cut, w e compute its total p o w er consumption using Equation (3.6). W e retain the cut that yields the smallest p o w er consumption so far. When the v alue of the maxro w exceeds K w e terminate and the estimated p o w er for no de v E P ( v ), is set to the smallest p o w er consumption recorded. This is so b ecause w e will not b e able to nd a K feasible cut for an y smaller threshold. Illustrativ e Example: Cut I Cut II Cut III 1 / 0 1 / 0 1 / 0 2 / 4 1 / 3 ( a ) a b c 1 / 0 g f e 1.5 / 3.5 2.2 / 5.2 d h v LUT LUT LUT a b c d e f g h v h v g ( b ) Figure 3.4. (a) Selecting a 3feasible Cut for No de v (b) The Mapping Solution. Figure 3.4(a) illustrates ho w a sequence of T b ounded K feasible cuts is determined for no de v for K =3. Next to eac h no de (except v ), the transition densit y ( d ) and EP v alue are sho wn in d=E P format, whic h ha v e already b een computed b efore pro cessing no de v F or simplicit y w e assume that K p C out and K p C in are equal to 1. The threshold v alue is rst set to 5.2, whic h is the largest EP v alue (for no de h ) of the no des in the set N v f v g The corresp onding mincut computed is Cut I. According to Equation (3.6), the estimated p o w er for Cut I is: E P C ut I ( v ) = [ d ( v ) + d ( g ) + d ( h ) ] + E P ( g ) + E P ( h ) = d ( v ) + 11 : 4 54 PAGE 66 W e store Cut I as the b est cut at this p oin t. Then threshold T is lo w ered to 4, the next highest EP v alue (for no de f ) in the set N v f v g and Cut I I is the mincut computed. The estimated p o w er for Cut I I is: E P C ut I I ( v ) = [ d ( v ) + d ( e ) + d ( f ) + d ( g ) ] + E P ( e ) + E P ( f ) + E P ( g ) = d ( v ) + 15 Since d( v )+15 is larger than d( v )+11.4, w e mo v e on to the next smaller EP v alue as the threshold. The threshold is lo w ered to 3.5 (for no de e ) and Cut I I I is the mincut computed. Since the cut size is larger than 3, w e will terminate. Hence, the mapping induced b y Cut I is the most desirable c hoice for no de v Figure 3.4(b) sho ws the corresp onding mapping solution to co v er N v 3.4.1.2 Incremen tal Net w ork Flo w Computation In this section, w e will describ e ho w w e ma y use incremen tal net w ork ro w metho d to ecien tly nd the lo wp o w er K feasible cut for a no de v W e rst construct the ro w net w ork with the largest EP v alue in N v as the threshold. Once w e ha v e found the rst T b ounded K feasible cut, w e ha v e to rep eat with the next largest EP v alue as the threshold. Ho w ev er, there is no need to build a new ro w net w ork to compute the new cut. W e only need to up date the residual net w ork for all no de v suc h that E P ( v ) equals to the old threshold v alue. W e use the same example as in Section 3.4.1.1. Supp ose w e w an t to compute a lo wp o w er 3feasible cut for no de v in Figure 3.4(a). The EP v alues of the ancestors of no de v in decreasing order are 5.2, 4, 3.5, 3, and 0. First, the threshold is set to 5.2, therefore X v can ha v e an y no de as its input no de. The required ro w net w ork is sho wn in Figure 3.5(a). Here w e used a standard net w ork transformation tec hnique, kno wn as no desplitting that transforms a no decapacitated net w ork in to an edgecapacitated net w ork so that an y existing edge cut computation algorithm can b e applied. A minim um cut of size 2 is computed b y maxim um ro w computation, the residual net w ork is sho wn in Figure 3.5(b). This cut 55 PAGE 67 corresp onds to Cut I in Figure 3.4(a). Since Cut I is 3feasible, so w e lo w er the threshold to 4 to compute a new cut ( X v ; X v ) suc h that no no de with EP v alue greater than 4 can b e an input no de to X v T o compute suc h a new cut, w e ma y simply up date the residual net w ork b y c hanging the ro w capacit y of no de h whose EP v alue is 5.2 from unit to innit y If there exists a new augmen ting path in the up dated residual net w ork, it can b e found ecien tly W e asso ciate eac h no de v with t w o rags: F S ( v ) equals to TR UE indicates that v is reac hable from source s and F T ( v ) equals to TR UE indicates that there exists a path from v to sink t in the residual net w ork. Note that the initial v alues of F S ( v ) and F T ( v ) can b e deriv ed sim ultaneously while tra v ersing the net w ork to compute the mincut of the original net w ork. When w e up date the ro w capacit y of a no de v w e also c hec k if its rags ha v e to b e up dated. F or example, F S ( h ) is TR UE and F T ( h ) is F ALSE in Figure 3.5(b). After no de h 's ro w capacit y is up dated, F T ( h ) needs to b e up dated to TR UE as in Figure 3.5(c). If F S ( v ) and F T ( v ) are b oth equal to TR UE, w e kno w that there exists an augmen ting path. In this case, w e will augmen t the ro w, tra v erse the net w ork again to compute a new mincut and the new FS and FT v alues for eac h no de. Since w e will terminate once the maxro w v alue exceeds K w e will tra v erse the net w ork at most K times. 3.4.2 Phase I I I: Mapping Generation Phase I I I is the mapping generation phase that generates a K LUT mapping solution of the whole circuit. The mapping solution is generated in a b ottom up manner starting from the PO no des. If a LUT needs to b e generated ro oted at v a K feasible cut is computed for no de v based on the cost v alues of its ancestors. Initially cost ( u ) is set to E P ( u ) for all no de u When w e generate a LUT ro oted at no de v the p o w er dissipation of subnet w ork N u for all no de u that pro vides input to LU T v will b e coun ted. So in order to a v oid coun ting the p o w er dissipation of subnet w ork N u again if w e generate another LUT that also receiv es input from no de u w e ha v e to reset cost ( u ) to 0 after the rst time it is coun ted. When w e compute the lo wp o w er K feasible cut in this phase, w e use the dynamically up dated cost v alues. 56 PAGE 68 s a b c d a' e f g h t e' f' g' h' b' c' d' 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (a) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s a b c d a' c' d' e f g e' f' g' h h' t 1 b' 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (b) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s a' c' a b c d d' e' e f' g' f g h t b' h' 1 1 1 1 1 1 1 1 11 1 1 1 1 1 (c) Figure 3.5. (a) Initial Flo w Net w ork (for T =5.2). (b) Residual Net w ork after the Maxro w is Computed. (c) The Up dated Residual Net w ork with a New Augmen ted P ath Sho wn in Bold Edges. 57 PAGE 69 W e also in tro duce a tec hnique to further impro v e the qualit y of the mapping. If a lo wp o w er K feasible cut found b y the ro w metho d has a size less than K w e will c hec k whether it is p ossible to reduce the sum of the p o w er consumption of the fanins b y replacing one no de in the no de cut set with its fanin suc h that the cut size still do es not exceed K W e call this cut fr ontier r enement F or example, assume Cut 1 sho wn in Figure 3.6 is a 4feasible cut computed for no de v Since the cut size is only 3, w e can increase the cut size to 4 and it is still 4feasible. If cost ( l ) + cost ( m ) < cost ( i ), w e kno w that a LUT generated according to Cut 2 consumes less p o w er. Similarly w e also c hec k if c ho osing the no de cut set f i; m; n; k g or f i; j; n; o g can sa v e more p o w er o v er Cut 2. W e do this recursiv ely un til w e cannot increase the cut size an y more. If there exist cuts that use less p o w er than the K feasible cut computed, w e will replace it with the one that reduces the most p o w er. Cut 1 Cut 2 l n o k j i m v Figure 3.6. Cut F ron tier Renemen t for Net w ork Ro oted at v (Assuming K =4). The complete algorithm of P o w erMinMap is sho wn in Figure 3.7. Line 1 { 4 is the rst phase whic h computes the transition densit y for eac h nonPI no de in the D A G. Line 5 { 9 is the second phase that computes E P ( v ), the estimated minim um p o w er consumption of a LUTsubnet w ork co v ering N v for ev ery no de v Line 10 { 21 is the last phase of PMM algorithm. In this phase, the mapping solution is generated. Once a LUT is generated, w e p erform cut fron tier renemen t to further reduce p o w er and up date the cost v alues of its fanins. 58 PAGE 70 Algorithm P o w erMinMap Input: General Bo olean net w ork N Output: A mapping of the net w ork in to K LUTs /* phase 1 */ 1 L list of nonPI no des in top ological order from PIs to POs; 2 for eac h no de v 2 L 3 compute d ( v ); 4 endfor /* phase 2 */ 5 for eac h no de v 2 L 6 compute a sequence of T b ounded K feasible cuts for no de v ; 7 E P ( v ) p o w er consumption corresp onding to the lo wp o w er K feasible cut; 8 cost ( v ) E P ( v ); 9 endfor /* phase 3 */ 10 Q a queue of all POs; 11 while Q 6 = ; 12 v dequeue ( Q ); 13 compute a lo wp o w er K feasible cut for no de v ; 14 p erform cut fr ontier r enement ; 15 generate a LUT ro oted at v ; 16 for eac h no de u 2 inputs ( LU T v ) 17 if no de u = 2 Q and no de u = 2 P I s then 18 enqueue ( u; Q ); 19 cost ( u ) 0; 20 endfor 21 endwhile End algorithm Figure 3.7. Pseudo co de of P o w erMinMap Algorithm. 3.4.3 Computational Complexit y of P o w erMinMap Theorem 1. A lowp ower K fe asible cut in N v c an b e c ompute d in O ( K m 0 + n 0 log n 0 ) time, wher e n 0 and m 0 ar e the numb er of no des and e dges in N v r esp e ctively. Pro of 1. It takes O ( n 0 log n 0 ) time to sort the EP values of the no des in N v f v g If we use an incr emental network row c omputation appr o ach to c ompute a set of T b ounde d K fe asible cuts to nd the lowp ower K fe asible cut in N v we ne e d to tr averse the network at most K times. T r aversing the network onc e takes O ( m 0 + n 0 ) time. Henc e, the total time ne e de d to c ompute a lowp ower K fe asible cut in N v is O ( n 0 log n 0 ) + O ( K m 0 + K n 0 ) which 59 PAGE 71 is O ( K m 0 + n 0 log n 0 ) When m 0 and n 0 ar e of the same or der and K is xe d, the c omplexity then b e c omes O ( n 0 log n 0 ) Theorem 2. Given a gener al Bo ole an network N, PowerMinMap gener ates a lowp ower mapping solution in O ( K mn + n 2 log n ) time, wher e n and m ar e the numb er of no des and e dges in N r esp e ctively. Pro of 2. In phase one of PMM algorithm, it takes a total of O ( n ) time to c ompute the output tr ansition density d ( v ) for al l nonPI no de v A c c or ding to The or em 1, to c alculate the lowp ower K fe asible cut for a subnetwork N v takes O ( K m 0 + n 0 log n 0 ) time, wher e m 0 and n 0 ar e the numb er of no des and e dges in N v r esp e ctively. Sinc e m 0 and n 0 ar e b ounde d by m and n phase two takes up to O ( n ( K m + n log n )) = O ( K mn + n 2 log n ) time to c ompute E P ( v ) and cost ( v ) for al l no de v Phase thr e e diers fr om phase two in that lowp ower K fe asible cut is r e c ompute d for L no des, wher e L is the numb er of LUTs gener ate d in the mapping solution ( L n ). A nd cut fr ontier r enement is also p erforme d. The time ne e de d for cut fr ontier r enement is O ( K ) So phase thr e e takes O ( n ( K m + n log n + K )) = O ( K mn + n 2 log n ) time. Thus, the over al l runtime for PMM algorithm is O ( n ) + O ( K mn + n 2 log n ) + O ( K mn + n 2 log n ) which is O ( K mn + n 2 log n ) If m and n ar e of the same or der and K is xe d, the total runtime is O ( n 2 log n ) 3.5 P o w erMinMapd: Sim ultaneous P o w er and Dela y Optimization In this section, w e con tin ue to study the tec hnology mapping problem for minimizing p o w er consumption with the priorit y of optimal depth. W e rst review the Flo wmap algorithm [39 ], whic h is kno wn to b e a dela yoptimal LUTbased tec hnology mapp er. Then w e presen t an extension of the P o w erMinMap algorithm to ac hiev e optimal dela y as w ell as p o w er sa vings. This extension is named as PowerMinMapd (PMMd), and it is v ery ecien t and eectiv e in computing a lo wp o w er dela yoptimal mapping solution. 60 PAGE 72 3.5.1 Review of Flo wmap Algorithm Flo wmap [39 ] is a LUTbased FPGA tec hnology mapping algorithm that generates dela yoptimal K LUT mapping solutions. Giv en a K b ounded Bo olean net w ork, let N v denote the subnet w ork consisting of no de v and all its predecessors. The lab el of no de v denoted b y l abel ( v ), is the optimal depth for a K LUT mapping of N v F or an y PI no de v l abel ( v ) is zero. In the rst phase of Flo wmap, the no des are pro cessed in a top ological order from PI no des to PO no des and the lab els for all no des are computed. Assume L is the maxim um lab el of the predecessors of no de v If no de u is a fanin of v w e dene col l apse as the op eration of remo ving u from N v and replacing ev ery edge ( x; u ) 2 E ( N v ) b y ( x; v ). T o compute l abel ( v ), Flo wmap rst collapses all no des u with l abel ( u ) = L in to v and computes a mincut in N v If the cut size is no larger than K Flo wmap sets l abel ( v ) to L and stores the cut. Otherwise, Flo wmap sets l abel ( v ) to L +1 and stores the cut ( N v f v g ; f v g ). Note that ( N v f v g ; f v g ) is guaran teed to b e a K feasible cut b ecause the original net w ork is K b ounded. In either case, the cut computed is called a minheight Kfe asible cut for no de v In the second phase, Flo wmap generates the mapping solution. Let ( X v ; X v ) b e the cut stored for no de v in the rst phase where v 2 X v Q is a queue that con tains all PO no des initially The follo wing op eration is rep eated un til Q con tains only PI no des: Remo v e the head no de v from Q and generate a LUT to co v er the no des in X v according to the minheigh t K feasible cut ( X v ; X v ) computed for no de v ; Insert in to Q the no des that pro vide inputs to the LUT just generated, if they ha v e not b een inserted y et. It is sho wn that Flo wmap guaran tees to generate depthoptimal K LUT mapping solutions for a giv en K b ounded net w ork N 3.5.2 P o w erMinMapd Algorithm PMMd algorithm consists of t w o phases. The rst phase of PMMd is similar to that of Flo wmap. F or eac h nonPI no de, w e compute its lab el and store its minheigh t K feasible cut. In addition, for eac h no de v w e also compute its output transition densit y d ( v ), and 61 PAGE 73 estimate the p o w er consumption of a K LUT mapping of subnet w ork N v ro oted at v Unlik e the PMM algorithm, w e estimate E P ( v ) in PMMd according to the minheigh t K feasible cut computed. The second phase of PMMd diers from that of Flo wmap: b efore generating a K LUT ro oted at v w e determine if the depth of this LUT can b e relaxed from the v alue l abel ( v ) without increasing the depth of the o v erall mapping solution.A similar idea w as also used b y Cong and Hw ang in [45 ] for area minimization with optimal depth. If w e can relax its depth, w e will compute a lo wp o w er K feasible cut for v and generate a K LUT accordingly Otherwise, w e will generate a K LUT using the minheigh t K feasible cut stored in the rst phase. W e can determine if the depth of a LUT ro oted at no de v can b e relaxed b y computing its slac k v alue: sl ack ( v ) = D opt l abel ( v ) D v (3.7) where D opt is the optimal depth of a K LUT mapping solution of the en tire net w ork, and D v is the maxim um n um b er of LUTs on a path from some c hild of no de v to some PO no de. a b c d e f g h i j k l m n o p q r v 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 3 3 w u s 4 0 0 0 Figure 3.8. Lab els Computed for a Bo olean Net w ork Assuming K = 3. 62 PAGE 74 Note that D opt is kno wn after phase one since D opt = max f l abel ( u ) : u 2 set of P O nodes g but D v has to b e determined dynamically when w e generate the actual mapping in a b ottom up manner. When w e are generating a LUT ro oted at no de v in phase t w o, a partial mapping solution co v ering all the successors of v should ha v e b een generated. Th us, D v is also kno wn b y this time. If sl ack ( v ) > 0, then the depth of a LUT ro oted at v can b e relaxed and no de v is called a noncritic al no de. Otherwise, the depth of a LUT ro oted at v cannot b e relaxed and suc h a no de is called a cr itical no de. F or example, consider computing the slac k v alue for no de u and v in Figure 3.8. The optimal depth D opt of the net w ork is 4, l abel ( u ) = 3, l abel ( v ) = 2, and D u = D v = 1. W e can compute the slac k for no de u and v according to Equation 3.7: sl ack ( u ) = 4 3 1 = 0, and sl ack ( v ) = 4 2 1 = 1. So no de u is a noncritical no de but no de v is a critical no de. Since no de v is noncritical, our algorithm has the rexibilit y to generate a dieren t mapping for N v if p o w er sa vings can b e ac hiev ed. The mapping solution generated b y Flo wmap for N v is sho wn in Figure 3.9(a). On the other hand, PMMd ma y generate the mapping solution sho wn in Figure 3.9(b) to reduce p o w er consumption. Note that the depth of no de v is increased b y one, but this will not increase the optimal depth of the en tire net w ork since no de v w as noncritical. Our algorithm guaran tees to generate optimal dela y mapping solutions and the explanation is as follo ws. When w e are generating a LUT ro oted at a noncritical no de v with l abel ( v ) = L w e kno w that its predecessors ha v e slac k v alues no less than sl ack ( v ). Once a lo wp o w er K feasible cut is computed, if a no de u in the cut set has l abel ( u ) = L w e kno w that sl ack ( u ) will b e decreased b y one. So after LUT v is generated, the slac k v alue of eac h no de u 2 inputs (LUT v ) will not b ecome negativ e. If sl ack ( u ) b ecomes 0, no de u b ecomes critical. Then w e will use the minheigh t K feasible cut to co v er no de u afterw ards. This justies that our algorithm alw a ys generates dela y optimal mapping solutions. The complete algorithm of PMMd is sho wn in Figure 3.10. Lines 1 { 10 accoun t for the rst phase where d ( v ), l abel ( v ) and E P ( v ) are computed for eac h nonPI no de v in a top ological order. Lines 11 { 16 c hec k if no de v is critical or not and mak e sure that a LUT 63 PAGE 75 ( a ) a b c d e f g h i j k l m n o p q r v 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 3 3 w u s 4 0 0 0 2 (a) ( b ) a b c d e f g h i j k l m n o p q r v 0 0 0 0 0 1 1 1 1 1 2 2 2 3 3 w u s 4 0 0 0 2 1 1 (b) Figure 3.9. Dieren t Mappings Assuming K = 3: (a) Using Flo wmap and (b) Using PMMd. will b e generated to co v er eac h PO no de. Lines 17 { 33 generate the nal mapping. F or a critical no de, the minheigh t K feasible cut computed in the rst phase is used to generate the mapping. F or a noncritical no de, w e compute a lo wp o w er K feasible cut and generate a LUT accordingly Then, w e up date the slac k v alues of the no des feeding in to this LUT and up date the cost v alue for these no des. 3.5.3 Computational Complexit y of P o w erMinMapd Theorem 3. Given a gener al Bo ole an network N PowerMinMapd gener ates a depthoptimal lowp ower mapping solution in O ( K mn + n 2 log n ) time, wher e n and m ar e the numb er of no des and e dges in N r esp e ctively. Pro of 3. A c c or ding to [39 ], the lab eling phase for al l no des in N c an b e done in O ( K mn ) time. Besides, it takes O (1) time to c ompute the output tr ansition density d ( v ) and E P ( v ) for e ach nonPI no de v So the time ne e de d in the rst phase of PowerMinMapd is O ( K mn ) In the se c ond phase, the numb er of LUTs gener ate d is b ounde d by n If v is a critic al no de, it takes O (1) time to gener ate LU T v using the minheight K fe asible cut c ompute d in phase one. Otherwise, it takes O ( K m 0 + n 0 log n 0 ) time to c ompute a lowp ower K fe asible 64 PAGE 76 Algorithm P o w erMinMapd Input: General Bo olean net w ork N Output: A mapping of the net w ork in to K LUTs /* phase 1 */ 1 L list of no des in top ological order from PIs to POs; 2 for eac h no de v 2 L 3 if v 2 PIs then l abel ( v ) 0; 4 else 5 compute the minheigh t K feasible cut for no de v and l abel ( v ); 6 compute d ( v ); 7 E P ( v ) p o w er estimation for the computed cut; 8 cost ( v ) E P ( v ); 9 endfor 10 D max f l abel ( v ): v 2 POs g /* phase 2 */ 11 for eac h no de v = 2 PIs 12 D v = 0; 13 if no de v 2 POs then 14 l ut ( v ) TR UE sl ack ( v ) D l abel ( v ); 15 else l ut ( v ) F ALSE sl ack ( v ) 0; 16 endfor 17 Q a queue of POs; 18 while Q 6 = ; 19 v dequeue ( Q ); 20 if l ut ( v ) = TR UE 21 if sl ack ( v ) = 0 then 22 generate LU T v using the minheigh t K feasible cut for no de v computed; 23 else 24 compute a lo wp o w er K feasible cut for no de v ; 25 generate LU T v according to the lo wp o w er K feasible cut computed; 26 for eac h no de u 2 inputs ( LU T v ) and u = 2 PIs and u = 2 Q 27 l ut ( u ) TR UE; 28 D u max ( D u ; D v + 1); 29 sl ack ( u ) D l abel ( u ) D u ; 30 cost ( u ) 0; 31 enqueue ( u; Q ); 32 endfor 33 endwhile End algorithm Figure 3.10. Pseudo co de of P o w erMinMapd Algorithm 65 PAGE 77 cut to gener ate LU T v wher e m 0 and n 0 ar e the numb er of e dges and no des in N v r esp e ctively. Ther efor e, phase two takes n 1 O (1) + n 2 O ( K m + n log n ) < O ( K mn + n 2 log n ) time, wher e n 1 is the numb er of LUTs gener ate d for c overing critic al no des and n 2 is the numb er of LUTs gener ate d for c overing noncritic al no des ( n 1 + n 2 < n ). The total time ne e de d by b oth phases is O ( K mn ) + O ( K mn + n 2 log n ) so the over al l c omputational c omplexity of PowerMinMapd is O ( K mn + n 2 log n ) When m and n ar e of the same or der and K is xe d, the total time c omplexity b e c omes O ( n 2 log n ) 3.6 Exp erimen tal Results of Lo w P o w er T ec hnology Mapping Algorithms W e ha v e implemen ted b oth the PMM and PMMd algorithms using C language and exp erimen ted with a set of MCNC b enc hmark circuits. Giv en a general Bo olean net w ork, w e rst optimize it b y running the general optimization script \rugged" within SIS. Then w e decomp ose the circuit in to 2b ounded net w ork using the \DMIG" command. The mapping solution is computed on the decomp osed net w ork. W e assume that V dd = 5 V and all PI no des ha v e equilibrium probabilit y p = 0.5, and transition densit y d = 10,000. The capacitances C in and C out are set to 10 pF eac h. When w e estimate the p o w er consumption of a mapping solution, the external loading capacitance of PO no des (whic h is not kno wn a priori ) is ignored since the p o w er consumption due to a giv en external load is indep enden t of the mapping. Note that the ab o v e assumptions ha v e b een used in [1] [2 ]. F or comparison with the results rep orted in [1 ] and [2], w e run our PMM algorithm for mapping in to 5input LUTs. The results of the PMM algorithm are sho wn in T able 3.1 in terms of p o w er consumption and the n um b er of LUTs for eac h circuit. F or comparison, w e quote the b est result rep orted for eac h circuit in [1]. W e also include the results from [2 ] b ecause w e notice that in some cases [2 ] outp erforms [1]. On the a v erage, our algorithm reduces the p o w er consumption b y 18.5% and 12.2% compared with [1 ] and [2 ], resp ectiv ely Besides, it uses 9.5% and 10.6% less LUTs than [1] and [2], resp ectiv ely 66 PAGE 78 T able 3.1. Comparison of P o w erMinMap with [1 ] and [2 ] (PWR: mW). W ang et.al. [1] F arrahi et.al. [2] PMM Impro v emen t (%) Vs. [1 ] Vs. [2] CKT LUTs PWR LUTs PWR LUTs PWR LUTs PWR LUTs PWR 5xp1 26 188 25 182 23 168 11.5 10.6 7.7 7.8 9sym 87 553 62 365 60 340 31 38.5 3.2 6.8 9symml 62 446 58 376 56 349 9.7 21.7 3.4 7.2 c499 98 1061 91 1076 74 983 24.5 7.4 18.7 8.6 c880 116 746 111 1060 106 718 8.6 3.8 4.5 32.2 alu2 128 874 146 836 124 815 3.1 6.8 15.1 2.5 ap ex6 192 1110 237 1404 183 1012 4.7 8.8 22.8 27.9 ap ex7 68 379 69 450 65 349 4.4 7.9 5.8 24.0 coun t 31 159 31 227 31 159 0 0 0 29.9 duk e2 152 538 190 478 145 443 4.6 17.6 23.7 7.3 misex1 13 97 16 106 12 92 2.3 5.2 25 13.2 rd84 33 261 27 344 32 243 3 6.9 18.5 29.4 rot 234 1306 238 1749 224 1203 4.3 7.9 5.9 31.2 vg2 32 182 25 60 26 158 18.7 13.2 4 1.3 z4ml 7 56 5 80 5 41 28.6 26.8 28.6 48.8 Av erage 10.6 12.2 9.5 18.5 W e note that the heuristic en umeration metho d of [1] do es not guaran tee that if more cuts are en umerated, the mapping solution will b e b etter in terms of p o w er consumption. One dra wbac k of [2] is that it fails to tak e the impact of the fanout n um b er of LUTs on the p o w er consumption in to accoun t. It tries to generate LUTs ro oted at no des with smaller transition densit y and maximizes the n um b er of inputs to eac h LUT. According to our exp erimen tation, for some cases, w e found that this approac h resulted in larger total p o w er consumption. Our algorithm impro v es this and generates b etter mapping solution b ecause it directly considers the impact of generating a single LUT on the mapping of the en tire net w ork. F or our PMMd algorithm, w e compared it with Cutmap [45 ]. Cutmap is an extension of Flo wmap that sim ultaneously minimizes the area and depth. In general, Cutmap also pro duces solutions with smaller p o w er consumption than Flo wmap b ecause it minimizes the area. W e ha v e run Cutmap within SIS and estimated the p o w er consumption on the 67 PAGE 79 circuits mapp ed in to 4LUTs. The results are sho wn in T able 3.2 with resp ect to p o w er consumption, area, and depth. The results generated b y our algorithm ha v e the same depths as those of Cutmap whic h are guaran teed to b e optimal. On the a v erage, PMMd reduces the p o w er consumption b y 14.1% at the exp ense of using 9.2% more LUTs than Cutmap. W e notice that for t w o small b enc hmark circuits (5xp1 and rd84), our algorithm p erformed sligh tly w orse than Cutmap in terms of p o w er consumption. But it outp erformed Cutmap in p o w er sa vings in all other cases. T able 3.2. Comparison of P o w erMinMapd and Cutmap (P o w er: mW). Cutmap P o w erMinMapd Circuit 4LUTs P o w er depth 4LUTs P o w er depth 5xp1 46 333 4 52 336 4 9sym 69 471 5 72 420 5 9symml 92 720 6 108 659 6 alu2 145 1404 13 146 936 13 alu4 326 3202 15 353 2110 15 c1355 74 2015 4 74 1463 4 c2670 280 2883 8 315 2527 8 c432 109 1114 10 113 951 10 c499 66 1836 4 74 1268 4 c5315 744 6954 9 831 6274 9 c7552 891 9592 8 984 7708 8 c880 132 968 10 144 912 10 dalu 683 4097 12 729 3761 12 duk e2 185 958 8 195 719 8 i8 1383 9853 6 1647 8615 6 k2 1243 5358 6 1311 4368 6 rd84 53 425 9 67 450 9 rot 359 2450 8 362 2252 8 vg2 51 292 4 60 286 4 x1 207 1081 4 218 1074 4 Comparison 1 1 1 1.092 0.859 1 Previous w orks on minimizing dynamic p o w er consumption did not handle the v ariation of transition densit y of the PI no des. In practice, the PI no des do not necessarily ha v e uniform equilibrium probabilities. So, w e also did some exp erimen ts using randomly generated 68 PAGE 80 equilibrium activities for PI no des. W e include t w o sets of exp erimen tal results compared with Cutmap as sho wn in T able 3.3. On the a v erage, the p ercen tages of p o w er and area sa vings of PMMd are similar to the results sho wn in T able 3.2. Note that the mappings generated b y Cutmap will not c hange while our algorithm generates dieren t mappings according to the c hanging transition activities of the no des in the net w ork. T able 3.3. Comparison of PMMd and Cutmap with Randomly Generated T ransition Densities for PI No des. Exp erimen t #1 Exp erimen t #2 CM PMMd CM PMMd Circuit LUTs PWR LUTs PWR LUTs PWR LUTs PWR 5xp1 46 308 50 291 46 382 52 401 9sym 69 515 73 493 69 491 72 459 9symml 92 773 99 657 92 715 108 687 alu2 145 1438 149 1225 145 1350 148 1087 alu4 326 3381 350 2597 326 3422 348 2780 c1355 74 2132 75 1657 74 1985 74 1481 c2670 280 2912 313 2590 280 2685 308 2352 c432 109 983 79 557 109 625 78 587 c499 66 1242 80 1129 66 1422 81 1326 c5315 744 7275 830 6920 744 7416 821 6639 c7552 891 10428 980 8217 891 9964 975 8036 c880 132 1107 142 981 132 973 145 908 dalu 683 4251 726 3882 683 4129 713 3716 duk e2 185 1057 193 877 185 972 195 816 i8 1383 9712 1588 8581 1383 10244 1639 9211 k2 1243 5409 1308 4872 1243 6140 1327 5327 rd84 53 478 65 490 53 510 64 524 rot 359 2792 362 2616 359 2507 263 2401 vg2 51 278 58 263 51 271 59 212 x1 207 1125 216 1082 207 1217 220 1202 Comparison 1 1 1.097 0.871 1 1 1.091 0.853 3.7 Conclusions on Lo w P o w er T ec hnology Mapping In this c hapter, w e studied the tec hnology mapping problem to minimize p o w er dissipation for LUTbased FPGAs. W e presen ted an algorithm that minimized p o w er as w ell 69 PAGE 81 as area signican tly It uses an ecien t incremen tal net w ork ro w computation approac h to compute lo wp o w er K feasible cuts minimizing the total p o w er consumption of the generated K LUTs. Through our exp erimen tation, w e notice that only minimizing the switc hing activit y of no des at the output of LUTs is not go o d enough. Hence our algorithm estimates the p o w er consumption for a set of p ossible c hoices and c ho ose the one that yields the b est result. Exp erimen tal results sho w that our algorithm ac hiev ed b oth p o w er and area sa vings o v er t w o previous p o w er minimization mapping algorithms. An extension of our w ork is also implemen ted to compute dela yoptimal lo wp o w er mapping solutions. Compared with Cutmap, a dela yoptimal mapp er with sim ultaneous area minimization, our algorithm reduces p o w er consumption without an y dela y p enalt y 70 PAGE 82 CHAPTER 4 PERF ORMANCEDRIVEN F OR CEDIRECTED PLA CEMENT ALGORITHM F OR HIERAR CHICAL FPGAS As micropro cessor with o v er a billion transistors around the corner, logic capacit y and complexit y ha v e b ecome extremely high. The tec hnology feature size is in deep submicron (DSM) regime, e.g., the most adv anced micropro cessors from In tel and AMD all use 0.13 Micron tec hnology All these mak e ph ysical design more imp ortan t and dicult as w ell. As one of the phases in ph ysical design, placemen t pla ys a crucial role as it indirectly determines the onc hip in terconnects whic h no w ada ys dominates the o v erall system dela y and hence is the b ottlenec k of enhancing system p erformance. In this c hapter, w e study the p erformancedriv en placemen t problem. W e prop ose a netbased forcedirected placemen t algorithm targetting hierarc hical FPGAs. This metho d also w orks w ell for rat FPGA arc hitectures. The input netlist is rst transformed in to a net dep endency graph. Then w e partition this graph in to clusters and a netcluster lev el ro orplanning is deriv ed b y sim ulated annealing to optimize the wirelength. F orcedirected netlev el placemen t is p erformed to generate a coarse net placemen t. Next, a forcedirected sc heme is dev elop ed to compute the logic cell placemen t iterativ ely where the forces on nets determine the p ositions of the cells. Finally w e assign I/O pins using a fast minim umw eigh t bipartite matc hing algorithm. The main con tribution of our w ork is that w e dev elop a topdo wn design ro w and apply forcedirected metho d for hierarc hical FPGAs to ac hiev e satisfying p erformance as compared to previous CAD T o ols. Compared with Xilinx F oundation to ols, the exp erimen tal results sho w that our algorithm impro v es the p ostla y out dela y (longest path dela y) and a v erage connection dela y b y an a v erage of 10.2% and 19.3% resp ectiv ely o v er a set of MCNC 71 PAGE 83 com binational b enc hmarks. W e also impro v e the maxim um clo c k frequency b y an a v erage of 20.7% o v er a set of MCNC sequen tial circuits. W e also compare our prop osed with VPR [87 ], a stateoftheart place and route suite for academic researc h, w e ac hiev e impro v emen t on total netlength as w ell as critical path dela y o v er the same set of b enc hmarks. This c hapter is organized as follo ws: Section 4.1 in tro duces sev eral previously prop osed placemen t algorithms for p erformance enhancemen t. In Section 4.2, w e sho w the hierarc hical FPGA arc hitecture whic h w e adopt in our researc h. In Section 4.3, w e presen t the details of our prop osed p erformancedriv en forcedirected placemen t approac h. Exp erimen tal results are giv en in Section 4.4. W e conclude in Section 4.5. 4.1 In tro duction Fieldprogrammable gate arra ys (FPGAs) are widely used for VLSI system design and rapid system protot yping due to its fast timetomark et and programmabilit y Recen tly FPGA man ufacturers ha v e in tro duced hierarc hical FPGAs suc h as Xilinx's Virtex family [10 ] and Altera's Ap ex family [93 ]. T o use these milliongate devices more ecien tly computeraided design for FPGAs has b ecome v ery imp ortan t. Placemen t has b ecome one of the most p ersisten t c hallenges in curren t digital system design as designs often con tain o v er a million placemen t blo c ks. Moreo v er, due to the dominance of in terconnect dela y placemen t is a ma jor factor to timing closure results [64 ]. Therefore, it is of great v alue to dev elop high p erformance placemen t algorithm for FPGAs. In recen t y ears, man y placemen t algorithms ha v e b een prop osed to cop e with the widely used ob jectiv e of wirelength minimization. Among these algorithms, sim ulated annealing is one of the most w ell dev elop ed placemen t metho ds. It is used in placemen t as an iterativ e impro v emen t algorithm. Giv en an original placemen t conguration, a c hange to the conguration is made b y mo ving a comp onen t, sw apping the lo cations of t w o blo c ks, or c hanging the asp ect ratio of one blo c k. Mo v es resulting in decreases in cost will b e accepted. Cost can b e wirelength, area, congestion, etc. Mo v es that result in an increase in cost are accepted with a probabilit y whic h decreases o v er the iterations. This helps to 72 PAGE 84 jump out of lo cal optima. Tim b erW olf b y Sw artz and Sec hen [71 ] is a w ell kno wn sim ulated annealing based to ol for timingdriv en placemen t but its target is standardcell based devices. Another sim ulated annealing based to ol, V ersatile Place and Route (VPR) b y Betz and Rose [87 ] and Marquardt, Betz, and Rose [94 ] is considered as the b est academic place and route suite for FPGAs. VPR is kno wn for generating v ery tigh t placemen ts. Ho w ev er, VPR handles primarily nonhierarc hical islandst yle FPGA arc hitectures. PR O XI [95 ] is a sim ulated annealing based timingdriv en placemen t algorithm for FPGAs. It p erforms sim ultaneous placemen t and routing b y rippingup and rerouting all disturb ed nets after eac h p erturbation. But it is v ery computationally exp ensiv e (app ears to b e O( n 3 ) based on its exp erimen tal results) whic h mak es it infeasible in real designs. Analytical placemen t is kno wn for generating fast placemen t solution using a quadratic wirelength ob jectiv e function. Ev en though the quadratic ob jectiv e is only an appro ximation of the wirelength, its main adv an tage is that it can b e optimized v ery ecien tly Eisenmann and Johannes [85 ] presen ted a forcedirected analytic placemen t tec hnique whic h applies constan t densit yinduced forces and uniformly distributes the cells. It w orks w ell in a v oiding o v erlap but ma y unnecessarily result in timing p enalt y for sparse designs. Most recen tly Alup oaei and Katk o ori [96 ] prop osed a netbased forcedirected macro cell placemen t approac h to optimize wirelength and their w ork sho w ed remark able impro v emen t on wirelength. Ho w ev er, their w ork is limited to semicustom ASIC devices. Another commonly used placemen t approac h is to partition the circuit in to subcircuits suc h that the in terconnect b et w een blo c ks is minimized. Hutton, Adibsamii, and Lea v er [74 ] presen ted a k w a y partitioning based timingdriv en placemen t algorithm. It handles hierarc hical arc hitecture but only targets Altera's APEX devices. As part of this dissertation, w e dev elop ed a no v el forcedirected p erformancedriv en placemen t algorithm for hierarc hical and rat FPGA devices. Details will b e pro vided in Section 4.3. 73 PAGE 85 4.2 Hierarc hical Xilinx FPGA Arc hitecture In our w ork, w e mainly target the hierarc hical Xilinx Virtex series FPGA c hip. The top lev el of a Virtex c hip consists of a t w odimensional arra y of congurable logic blo c ks (CLBs), v ertical and horizon tal routing c hannels, and input/output blo c ks. Figure 4.1 illustrates the top lev el arc hitecture of an arra ybased FPGA. The congurable logic blo c ks, denoted as C LB in Figure 4.1, are customizable to implemen t the logic functions. The connection blo c k, denoted as C in Figure 4.1, connect the CLB pins to the routing c hannels. A horizon tal routing c hannel and a v ertical routing c hannel are connected via a switc h b o x denoted as S in Figure 4.1. A switc h blo c k is comprised b y a n um b er of programmable switc hes whic h accoun t for the connections of FPGA routing. Usually the switc hes ha v e higher resistance and capacitance, and hence result in signican t dela ys. The routing c hannels are segmen ted in order to balance the circuit p erformance and routabilit y Routing trac ks consist of a set of wires with dieren t lengths where longer wires are desirable for timingcritical nets and shorter wires are in tended for short connections to sa v e routing resources. I/OI/O I/OI/O C CC C C CC C S S S S S S S S C C C CLB CLB CLB CLB C S I/O I/OI/O I/O connection block horizontal routing channel switch box vertical routing channel Figure 4.1. T op Lev el View of Xilinx Hierarc hical FPGA. 74 PAGE 86 Belo w, w e describ e the details of Virtex v800fg680 c hip whic h is the main target device used in our exp erimen ts. The top lev el of this FPGA c hip consists of an 84 b y 56 arra y of CLBs. Eac h CLB consists of t w o slices where a slice is the basic logic cell at the b ottom lev el. F rom no w on, w e use the terms sl ice and cel l in terc hangeably throughout this c hapter. In eac h slice, there are a pair of 4input LUTs and a pair of D rip/rops, fast carry logic, threestate driv er, and con trol logic. Figure 4.2 sho ws the simplied CLB structure. F/F LUT Slice 0 CarryCarry F/FF/F LUT 4Input LUT 4Input Slice 1 CLB CarryCarry 4Input LUT F/F 4Input Figure 4.2. Simplied Arc hitecture of an CLB. Routing resources are a v ailable at b oth the top and the b ottom lev els. The lo cal routing net w ork within a CLB accoun ts for less dela y than the in terconnections b et w een CLBs. On the p eriphery of the FPGA c hip, there are 512 generalpurp ose I/O pads accessible to the users. In addition, there are four dedicated clo c k input pads, t w o at the top cen ter and t w o at the b ottom cen ter. All I/O blo c ks are accessed via global routing resources. Other Virtex series FPGA c hips dier from this example in the n um b er of total CLBs, the n um b er of slices in eac h CLB, and the amoun t of routing resources, etc. 4.3 Prop osed Placemen t Algorithm W e prop ose our forcedirected p erformancedriv en placemen t algorithm in this section. This section is organized as follo ws: In Section 4.3.1, w e outline this w ork. In Section 4.3.2, w e discuss the net cluster lev el ro orplanning. In Section 4.3.3, w e presen t the net lev el 75 PAGE 87 placemen t using forcedirected metho d. In Section 4.3.4, w e sho w the details of cell placemen t. In Section 4.3.5, w e prop ose a revised bipartite matc hing based approac h to place the IO pins. In Section 4.3.6, w e summarize our algorithm. 4.3.1 Ov erview of the Algorithm Our prop osed FPGA placemen t algorithm has three lev els in a topdo wn manner. The input netlist to the placer is obtained after tec hnology mapping b y Xilinx CAD T o ol. W e rst c hec k the n um b er of connected comp onen ts according to the netlist. If it has m ultiple connected comp onen ts, w e will pro cess eac h comp onen t separately Then, a net dep endency graph is formed from the netlist. The top lev el of our algorithm partitions the nets in to clusters and netcluster lev el ro orplanning is p erformed using sim ulated annealing to optimize the total netlength. The in termediate lev el of our algorithm computes a coarse net placemen t using a forcedirected metho d. The b ottom lev el of our approac h ac hiev es the detailed placemen t also using forcedirected metho d where the forces on nets determine the p ositions of the cells in this phase. Finally the I/O pins are placed with a revised minim um w eigh ted bipartite matc hing sc heme. The nal placemen t is fed bac k to the Xilinx to ol to p erform a reen tran t routing. Figure 4.3 illustrates the design ro w of our prop osed algorithm.4.3.2 Net Cluster Flo orplanning The input netlist to our placemen t algorithm is deriv ed after tec hnology mapping. W e rst c hec k if the design has m ultiple connected comp onen ts whic h can b e done b y running a depthrst searc h on the netlist. By doing so, w e can pro cess eac h comp onen t individually This not only reduces the subproblem size, but also assures that the logic cells in dieren t comp onen ts will not b e placed in the same region on the FPGA. Hence, routing trac can b e decreased and the p erformance of the placemen t will b e increased. In the top lev el of our algorithm, a net dep endency graph is constructed from the netlist and clique partitioning is carried out on this graph. A net dep endency gr aph G = ( V ; E ) is a w eigh ted undirected 76 PAGE 88 Detailed placement solution Placement and timing information Technologymapped Netlist and target device information Convert the netlist to net dependency graph and partition nets into clusters Netcluster level floorplanning using Simulated Annealing Forcedirected coarse net placement Place I/O Pins by bipartite matching Forcedirected detailed cell placement Perform reentrant routing Figure 4.3. Design Flo w of the Prop osed Placemen t Algorithm. 77 PAGE 89 graph whic h is constructed from the netlist. Eac h no de v 2 V denotes a net, and an edge ( u; v ) exists if and only if the nets represen ted b y u and v share at least one cell. The w eigh t of eac h edge ( u; v ) is determined b y the n um b er of cells shared b y net n u and n v A net clique also called a net cluster is a complete subgraph of the net dep endency graph. A net cluster indicates a set of strongly connected nets whic h preferably should b e routed in the same region to optimize timing and the total netlength. The clique partitioning problem is NPcomplete in nature, so w e use the clique partitioning heuristic prop osed b y Tseng and Siewiorek [97 ]. W e mo dify this metho d in suc h a w a y that w e try to nd the maxim um w eigh ted cliques. W e rst build a priorit y queue to store the nets according to the n um b er of their neigh b oring nets in descending order. Then, w e partition the net dep endency graph starting from the head of the queue. W e c hec k if other nets can b e added to the curren t clique according to their order in the priorit y queue. The clique partitioning pro cedure terminates when ev ery net has b een added to a net cluster. Figure 4.4(a) giv es an example of an input netlist. Figure 4.4(b) sho ws the corresp onding net dep endency graph and the net clusters deriv ed b y using the heuristic. Eac h edge in the net dep endency graph has unit w eigh t except edge ( n 1 ; n 6 ) and edge ( n 4 ; n 6 ) whic h ha v e a w eigh t of 2. The motiv ation for clustering nets rather than clustering cells is that w e w ould lik e to group nets ha ving high in terconnections together suc h that they will b e placed in nearb y lo cations while p erforming netlev el ro orplanning. Consequen tly logic cells within eac h cluster are to b e placed close to eac h other in the nal placemen t. As a result, these strongly connected nets can b e routed as m uc h as p ossible b y using shorter routing segmen ts. This also implies that lesser n um b er of routing switc hes will b e used whic h helps to impro v e timing. Longer routing segmen ts can b e sa v ed to route global critical paths and the total wirelength will b e reduced as w ell. Once the net dep endency graph is partitioned in to net clusters, w e will place the net clusters in suc h a w a y that the o v erall in terconnect is minimized. Because there are slices connected b y nets whic h b elong to dieren t clusters, w e w ould lik e to minimize the distance 78 PAGE 90 ( a ) C C C C C C n n n n n n 1 1 2 2 3 3 4 4 5 5 6 6 Clique 1 Clique 2 ( b ) 1 1 11 1 1 1 1 1 1 2 2 n n n n n n 1 4 26 5 3 Figure 4.4. Example of Net Clustering: (a) Netlist (b) Net Dep endency Graph. b et w een these clusters as w ell. The area of a cluster C is estimated as: A ( C ) = X s i 2C A ( s i ) (4.1) If slice s i solely b elongs to C A ( s i ) = 1 : Otherwise, A ( s i ) = 1 n where n is the n um b er of clusters that ha v e slice s i in common. The net cluster lev el ro orplan is obtained b y sim ulated annealing based approac h using the sequencepair represen tation [98 ]. The ro orplan is represen ted b y t w o sequences of the clusters, P and N whic h denote the relativ e p ositions of the clusters. Since the n um b er of net clusters is smaller than the n um b er of slices, the time needed to nd a net lev el ro orplan is m uc h less than the time required to nd a ro orplan at the slice lev el whic h is 79 PAGE 91 v ery desirable. The initial seed to the sim ulated annealing approac h is a randomly generated placemen t whic h conforms to the dimension constrain ts of the target FPGA c hip. W e allo w the follo wing mo v es: exc hange t w o clusters in the P sequence, exc hange t w o clusters in b oth the P and N sequences sim ultaneously and c hange the asp ect ratio of a cluster [98 ]. The placemen t problem for FPGAs diers from that of ASICs in that there exists dimension constrain t. This is so b ecause the n um b ers of ro ws and columns of CLBs are xed for the target FPGA device. So a mo v e is accepted if it do es not violate the constrain t. While c hanging the asp ect ratio of a cluster, w e c hange its width b y a factor randomly generated within the range of 30% to +30%. Its heigh t is mo died accordingly in order to k eep the area of this cluster unc hanged. The cost function of a ro orplan, F is a w eigh ted sum of the in terconnections b et w een clusters, and the in terconnections b et w een clusters and I/O pins on the b oundaries of the target device: C ost ( F ) = X i 6 = j k 1 D ( C i ; C j ) + X i k 2 N i I ( C i ) (4.2) where D ( C i ; C j ) is the Manhattan distance b et w een the geometrical cen ters of the b ounding b o xes of clusters C i and C j I ( C i ) is the shortest distance b et w een the geometrical cen ter of C i and the b oundaries of the device, N i is the n um b er of slices in C i whic h ha v e connections with I/O pins. The factors k 1 and k 2 in Equation 4.2 accoun t for the o v erhead for routing. They are obtained empirically and can b e adjusted according to the routing resources a v ailabilit y of the FPGA c hip. In our approac h, k 1 and k 2 are set to 1 and 1.3, resp ectiv ely W e assign k 2 a larger v alue than k 1 b ecause w e w ould lik e to prioritize those cells connected to external I/O pins. This indicates that I/O connected cells will b e placed close to the b oundaries to impro v e total wirelength and timing. Note that w e compute D ( C i ; C j ) only when the t w o clusters share at least one slice, and compute I ( C i ) only when at least one slice in C i has connection with an I/O pin. 80 PAGE 92 The initial temp erature of the sim ulated annealing pro cedure is computed after generating 20 random ro orplans as in [99 ]: I nitT emp = 20 X i =1 C ost ( F i ) log (4.3) where the factor is set to b e 0.95 in our approac h. The temp erature co oling sc hedule used is sho wn as follo ws: T new = 8><>: 0 : 95 T ol d if T ol d 2 [0 : 3 ; 0 : 8] I nitT emp 0 : 80 T ol d O ther w ise (4.4) The temp erature is lo w ered at a faster rate while it is v ery high ( > 0.8 InitT emp) or v ery lo w ( < 0.3 InitT emp), and it is lo w ered at a slo w er rate otherwise. Since generally sim ulated annealing is considered to b e v ery time consuming, w e use this faster co oling sc hedule to reduce the run time without p enalizing the qualit y of the result. The sim ulated annealing pro cess is terminated when the temp erature v alue drops b elo w 1. 4.3.3 Coarse Netlev el Placemen t The in termediate lev el of our algorithm is to p erform a coarse net lev el placemen t using a forcedirected sc heme. Belo w, w e rst discuss the c haracteristics of v arious forces w e used in our approac h. Then, w e describ e ho w w e use this metho d to place nets. 4.3.3.1 A ttractiv e and Repulsiv e F orces In our forcedirected placemen t approac h, w e use t w o t yp es of forces. The main force is the attr active for c e whic h ob eys Ho ok e's la w as in Equation 4.5, where k is a constan t kno wn as spring constan t and x is the spring deformation. F = k x (4.5) 81 PAGE 93 The attractiv e force pulls nets connected in the net dep endency graph to closer p ositions. Using attractiv e force on placemen t w as rst prop osed in [100 ]. Since net clusters ha v e dieren t shap es and sizes, w e also apply a r epulsive for c e to a v oid net cluster o v erlaps. W e ha v e to nd the equilibrium p ositions for the ob jects (nets in in termediate lev el placemen t and logic cells in lo w lev el placemen t). This giv es a p oten tial placemen t that results in short net lengths. Since the repulsiv e force is el ectr ostatic in nature, it has a circular symmetry c haracteristic. W e mo del eac h net as a circle with radius R estimated as follo ws: R = 0 : 4 ( h + w ) = (log n + 1) (4.6) where h and w are the heigh t and width, resp ectiv ely of the net cluster con taining this net, and n is the n um b er of clusters this net is connected with. In general, the attractiv e force on ob ject i due to ob ject j is computed as follo ws: F a( i;j ) = K a ( i;j ) ( p i p j ) (4.7) where K a ( i;j ) is an analogous attractiv e co ecien t, p i and p j are the p osition v ectors of ob jects i and j resp ectiv ely In our case, the v alue of the force factor K a ( i;j ) is set to 1 initially It will b e increased after eac h iteration using Equation 4.8 in order to attract connected ob jects ev en closer. K a ( i;j ) = K a ( i;j ) ( ol d ) + (1 )( j p i p j j =D ) (4.8) where is a user dened constan t b et w een 0 and 1 (0.5 in our approac h), and D is the maxim um distance b et w een ob jects i and j In order to prev en t t w o ob jects from o v erlapping, w e in tro duce some form of repulsiv e force. The repulsiv e force has an electrostaticlik e c haracteristic, whic h is strong at close distances and w eak at long distances. So, whenev er t w o ob jects are getting to o close, the repulsiv e force will b e v ery strong to push them apart. The repulsiv e force on ob ject i due 82 PAGE 94 to ob ject j is determined b y the follo wing equation: F r( i;j ) = K r ( r =d ) 2 p i p j j p i p j j (4.9) where K r is the repulsion constan t whic h is set to 1 in our case, r denotes the sum of R i + R j where R i and R j are the radii of ob jects i and j resp ectiv ely and d is the distance b et w een the t w o ob jects. In order to nd the equilibrium p ositions for a set of ob jects, w e need to nd a p osition for eac h ob ject suc h that the total force on ob ject i F total ( i ), is zero. F total ( i ) is computed as sho wn in Equation 4.10: F total ( i ) = n X j =1 ;i 6 = j [ F a( i;j ) + F r( i;j ) ] (4.10) This problem can b e solv ed using the NewtonRaphson's metho d [101 ]. The equilibrium p osition for ob ject i p ( i;e ) can b e found as follo ws: p ( i;e ) = p i k e F total ( i ) @ F total ( i ) @ p (4.11) where k e is a constan t whic h in our approac h is set to 0.5, and p i is the original p osition of ob ject i W e consider p ( i;e ) as an equilibrium p osition when j p ( i;e ) p i j < where is the maxim um user c hosen admissible tolerance. W e p erform this computation on all ob jects one b y one un til the equilibrium p ositions of all ob jects are found. 4.3.3.2 Net Placemen t After w e ha v e computed a net cluster ro orplanning, w e use the forcedirected metho d to obtain a coarse net placemen t. In this pro cess, an ob ject refers to a net. In order to c haracterize the prop erties of nets, w e use the in terconnect mo del presen ted b y Mo et al. [102 ], where eac h net is mo deled as a star instead of a complete graph. Figure 4.5(a) 83 PAGE 95 Star (a) (b) Figure 4.5. (a) Star Mo del of a 5pin Net. (b) Complete Graph Mo del of a 5pin Net. sho ws an example of a 5pin net represen ted b y the star mo del. F or a k pin net represen ted b y this mo del, eac h net has a star no de to whic h all k terminals of the net are connected. Hence, k wires and a star no de suce. On the other hand, the complete graph represen tation needs k ( k 1) = 2 wires. Figure 4.5(b) giv es an example using this mo del for a 5pin net. F or nets with large n um b er of pins, the complete graph mo del leads to redundan t wires and o v erestimates the wirelength. The star mo del is appropriate for our forcedirected metho d since the star no de and the terminal attracts eac h other suc h that the wirelength will b e optimized. In addition, it con tains information of ho w nets ma y b e routed b y the router. But complete graph mo del do es not p ossess an y information ab out p ossible routing. In this pap er, w e only consider single star no de nets. This mo del can b e easily expanded to deal with m ultistar nets. F or example, w e can mo del t w ostar net as an el l iptic where the t w o stars are the f oci of the ellipse. The force on a net n i due to a net cluster C l is giv en b y Equation 4.12. F ( n i ) = X ( n i ;n j ) 2 E F a ( n i ; n j ) + X i 6 = j n j 2 C l F r ( n i ; n j ) + k l F a ( n i ; C l ) (4.12) The rst term accoun ts for the attractiv e force b et w een net n i and all nets n j whic h are adjacen t to n i in the net dep endency graph G The second term describ es the repulsiv e force b et w een nets in the same net cluster whic h is useful to a v oid all nets b eing placed in 84 PAGE 96 the same lo cation. The last term is the force to main tain net n i to b e in the region allo cated for the corresp onding net cluster C l The factor k l is the n um b er of nets connected with n i but do not b elong to C l Initially all nets in the same cluster are placed randomly in the designated area for this cluster. Then, the equilibrium p ositions of the nets are found using Equation 4.11. Note that the p osition of a net here is actually referring to the p osition of the star no de of the net. This pro cedure w orks in a net b y net fashion instead of pro cessing all nets sim ultaneously 4.3.4 Logic Cell Placemen t In this phase, the detailed logic cell placemen t is generated again using the forcedirected metho d. Eac h cell is also mo deled as a circle and its radius is computed using the follo wing form ula: R = ( h + w ) = 4 (4.13) where h and w are the heigh t and width of the cell, resp ectiv ely A t rst, logic cells are placed randomly within the region determined in the net cluster placemen t phase. If a logic cell is in more than one net cluster, it will b e placed inside the region of the rst net cluster con taining it. So eac h logic cell will b e placed only once. An iterativ e forcedirected sc heme is carried out on the cells un til the equilibrium p ositions are found. In order to sa v e computing time, w e will terminate this pro cess once the n um b er of iterations exceeds 200. The force on a slice s i is computed as: F ( s i ) = X n j 2 N i F a ( s i ; n j ) + r X I =O j 2 P i F a ( s i ; I =O j ) + X s j 2 S i F r ( s i ; s j ) (4.14) The rst term is an attraction force b et w een slice s i and the nets to whic h s i is connected with. N i denotes the set of nets connected with this slice. The second term is also an attraction force. If slice s i is connected with I/O blo c ks, this force attracts s i to the c hip b oundary P i denotes the set of I/O pins that are connected with s i and r is a factor bigger than 1 (1.5 in our approac h) to indicate the priorit y of in terconnections with I/O 85 PAGE 97 pins. The last term is a repulsiv e force b et w een s i and all slices that are adjacen t to it where S i denotes the set of slices connected with s i The factor is set to 0.5 in our algorithm. Unlik e the placemen t problem in ASICs, it is unnecessary to sp ecically consider a v oiding o v erlaps for FPGAs due to the fact that all logic will b e placed in to CLBs whic h are in discrete lo cations. Once the p ositions of all the slices are stable, w e will place them on the FPGA c hip. Since the lo cations for CLBs are represen ted b y in tegers, w e should con v ert all slices' co ordinates in to in tegers at this time. F or example, if slice s 's co ordinate is (4.3, 5.8), w e w ould con v ert the new co ordinate to (5, 6) using the ceil ing function b ecause the lo cation for CLBs on the FPGA b oard starts from (1,1). This ma y cause a problem when the n um b er of slices to b e placed in to the same CLB is greater than the n um b ers of slices a v ailable in a single CLB. This is the motiv ation wh y w e ha v e the third term in Equation 4.14. By enforcing repulsiv e forces on slices, their p ositions will not b e to o close to eac h other and this strategy eectiv ely reduces the p ossibilit y that to o man y slices ha v e to b e placed in to the same CLB. In case there still exists conrict, w e will place those extra slices in to nearb y a v ailable CLBs. Generally when w e determine ho w to assign slices to a v ailable CLBs, w e use the follo wing criteria. First, the slices are placed according to the order obtained while reading the input netlist. Second, b ecause routing within the same slice results in less dela y w e will try to pair up slices with the largest n um b er of connections b et w een them and put them in to the same CLB. They will b e placed in to sl ice (0) and sl ice (1), resp ectiv ely This will not only impro v e the timing b et w een this slice pair but also reduce the demand for routing resources in the CLB lev el. Finally w e use the follo wing sc heme to ne tune the placemen t. After all slices ha v e b een placed, w e will try to mo v e slices around as long as the total netlength can b e reduced. According to our exp erimen ts, in most cases this giv es us sligh tly b etter results at negligible extra running time. Besides, if the n um b er of I/O pins of the circuit is larger than 75% of the I/O blo c ks a v ailable on the c hip, w e w ould place the slices in the middle region of the c hip. Otherwise, w e place the slices in a region close to one corner of 86 PAGE 98 the c hip. The results generated using this simple sc heme ha v e b etter p erformance than the results without using it. 4.3.5 I/O Pin Matc hing Once the slices ha v e b een placed, w e need to place the I/O pins on to the I/O blo c ks on the FPGA c hip. Essen tially this is a w eigh ted bipartite matc hing problem whic h can b e solv ed optimally b y the Munkres' assignmen t algorithm [103 ]. A revised v ersion of this algorithm w as used in our previous w ork [104 ] in order to impro v e the critical path dela y In this pap er, w e mo dify the I/O matc hing algorithm w e used previously to further reduce the running time. The computational complexit y of the Munkres' algorithm is O ( n 3 ), where n is the n um b er of I/O blo c ks a v ailable on the c hip. The running time w ould b e extremely high for I/O in tensiv e circuits. W e rst p erform a top ological sort starting from the input slices (slices connected to input pins) of the circuit to compute the longest path dela y for eac h output slice (slice connected to an output pin). W e assume that for eac h slice connected with input pin, w e can place the corresp onding input pin at the nearest p ositions on the b oundaries of the device. The dela y is estimated using the Manhattan distance b et w een slices b ecause the routing information is not y et a v ailable at this time. By doing so, w e can nd the critical path for ev ery output slice s i and w e also deriv e the corresp onding input slice on this path. W e call this input slice the critic al input slic e of output slice s i Then, starting from the output slice s i with the largest dela y w e nd the nearest I/O blo c k a v ailable on the FPGA c hip to place the output pin for slice s i Then w e place the input pin for the corresp onding critical input slice in the same w a y if this input pin has not b een placed y et. In case an input slice s i is on the critical path of m ultiple output slices, s i 's input pin p osition is decided while pro cessing the rst output slice whic h has s i as its critical input slice. Once all pins connected to output slices and critical input slices ha v e b een placed, w e compute the minim um w eigh ted bipartite matc hing for the I/O blo c k matc hing of the remaining input pins to minimize the total in terconnections. The adv an tages of this metho d include: (1) By placing the pins connected to output slices and critical input slices rst, 87 PAGE 99 w e can impro v e the pintopin dela y of the circuit b ecause basically w e w an t to prioritize the critical paths for routing. (2) Since some I/O blo c ks ha v e b een placed, the problem size is decreased and th us the running time is reduced as w ell. Through our exp erimen ts, this sc heme not only impro v es the p ostla y out dela y but also reduces the computation time dramatically 4.3.6 Summary of the Prop osed Algorithm W e ha v e in tro duced our forcedirected p erformancedriv en algorithm for hierarc hical FPGAs. The o v erall design ro w of our algorithm is sho wn in Figure 4.6. Lines 0 { 2 accoun t for the top lev el of our prop osed approac h where a netdep endency graph is constructed on top of the input netlist deriv ed after tec hnology mapping and netlev el ro orplanning is p erformed using sim ulated annealing. Line 3 { 6 is the in termediate lev el of our algorithm. In this stage, w e nd a coarse netlev el placemen t with a forcedirected metho d. Lines 7 { 11 describ e the phase that computes the celllev el placemen t. Lines 12 { 19 place the I/O pins using a minim um bipartite matc hing sc heme. Since w e ha v e used a n um b er of constan ts in our algorithm, w e summarize these constan ts as sho wn in T able 4.1. T able 4.1. List of Constan ts Used in Our W ork. Name In Eqn. Meaning V alue k 1 (4.2) routing o v erhead for CLBs not connected to I/O pins 1 k 2 (4.2) routing o v erhead for CLBs connected to I/O pins 1.3 (4.3) factor to reduce temp erature during SA 0.95 (4.8) factor to c hange the spring constan t K 1 0.5 K r (4.9) spring constan t computing repulsiv e force 1 k e (4.11) compute the equilibrium p osition 0.5 r (4.14) w eigh t on in terconncetions with I/O pins during cell placemen t 1.5 (4.14) w eigh t on in terconnections b et w een slices during cell placemen t 0.5 88 PAGE 100 F orcedirected Placemen t Algorithm Input: Netlist obtained after tec hnology mapping Output: A p erformancedriv en placemen t solution /* Netcluster level ro orplanning */ 0 Construct net dep endency graph G ; 1 P erform clique partitioning on G ; 2 Netclusterbased ro orplanning b y sim ulated annealing; /* Co arse net plac ement */ 3 while nets are not stable do 4 Calculate forces on nets; 5 Find new p ositions for nets; 6 endwhile /* Slic es plac ement */ 7 while slices are not stable do 8 Calculate forces on slices; 9 Find new p ositions for slices; 10 endwhile 11 P erform slices mo v emen t; /* I/O pins matching */ 12 P erform a top ological sort to compute the longest path dela ys of the output slices; 13 Q a queue of the output slices according to their longest path dela ys in descending order; 14 while Q 6 = ; do 15 slice( i ) dequeue(Q); 16 Place the output pin of slice(i) to the closet I/O blo c k a v ailable; 17 Place the input pin of the corresp onding critical input slice in the same w a y if it has not b een placed; 18 endwhile 19 Place the rest of input pins b y minw eigh t bipartite matc hing; Endalgorithm Figure 4.6. F orceDirected P erformanceDriv en Placemen t Algorithm. 4.4 Exp erimen tal Results W e ha v e implemen ted the forcedirected placemen t algorithm using the C++ language and tested on a set of MCNC b enc hmarks on a P en tium 1.5 GHz Lin ux system with 256MB RAM. W e compare our forcedirected placemen t algorithm with Xilinx F oundation 4.1i, a commercial FPGA CAD to ol. W e also mo dify the prop osed algorithm in order to compare with VPR [87 ], a w ellkno wn place and route to ol dev elop ed b y the Univ ersit y of T oron to. 89 PAGE 101 4.4.1 Comparison with Xilinx F oundation T o ol The input netlist is in VHDL format whic h can b e deriv ed using the blif2vhdl script. Firstly w e run Xilinx F oundation alone to map, place, and route eac h design. Secondly w e use the Xilinx to ol for the mapping, and use our forcedirected algorithm to place the mapp ed circuit and p erform reen tran t routing using P AR of Xilinx F oundation to ol. Our target device is Xilinx Virtex v800fg680 and w e use the default settings while w e run the F oundation to ol. W e used \par k rl 2" to p erform the reen tran t routing on the placemen t generated b y our algorithm and the routing lev el \rl 2" w as the same when w e run the en tire pro cess solely using F oundation T o ols. The detailed exp erimen tal ro w is sho wn in Figure 4.7. Input Netlist .xdl netlist of mapped circuit Place circuit using Xilinx Tool Place circuit with forcedirected method Convert the placed circuit to ncd format .ncd netlist of placed circuit Map the circuit with Xilinx Foundation Tool xdl ncd2xdl: convert .ncd format to .xdl format Perform reentrant detailed routing with Xilinx Timing information Figure 4.7. Exp erimen tal Flo w of Our Algorithm. 90 PAGE 102 W e rst conducted our exp erimen ts on a set of com binational circuits. T able 4.2 sho ws the c haracteristics of these circuits where the n um b er of slices and nets are obtained after tec hnology mapping. T able 4.2. Characterics of Com binational Circuits. Ckt # Slices # Nets c2670 315 332 c3540 266 425 c5315 541 643 c6288 368 628 c7552 580 698 dalu 265 411 des 1159 1530 i10 885 1044 i8 414 529 k2 304 454 pair 554 644 The exp erimen tal results for these com binational circuits are sho wn in T able 4.3 where D1 refers to the maxim um pintopin dela y from a PI to a PO, D2 is the a v erage connection dela y b et w een slices, and T is the CPU time to place eac h design. Note that the CPU time for Xilinx is deriv ed directly from the place and route rep ort generated b y Xilinx F oundation T o ol. On a v erage, w e impro v e the maxim um pintopin dela y and a v erage connection dela y b y 10.2% and 19.3% resp ectiv ely W e can see that our algorithm used more CPU time than Xilinx. Ho w ev er, the longest run time (for \des") is around three min utes, whic h is not signican tly inferior to that of Xilinx. W e also did exp erimen ts on a set of sequen tial circuits and the c haracteristics of these circuits are sho wn in T able 4.4. The exp erimen tal results for sequen tial circuits are sho wn in T able 4.5. The maxim um impro v emen t ac hiev ed is 75.7% for \bigk ey" and the a v erage impro v emen t on the maxim um clo c k sp eed is 20.7%. In general, our forcedirected placemen t approac h outp erforms Xilinx's CAD to ol. It not only results in b etter timing, but also reduces the wirelength dramatically whic h is also a go o d indication for b etter routabilit y In addition, as w e do not ha v e the information ab out ho w the P AR to ol of Xilinx F oundation p erforms routing, sometimes w e could not impro v e the longest path dela y to its full p oten tial. 91 PAGE 103 T able 4.3. Comparison with Xilinx F oundation for Com binational Circuits. Xilinx Ours % Impro v emen t Ckt D1(ns) D2(ns) T(s) D1(ns) D2(ns) T(s) D1 D2 c2670 25.50 2.80 4 23.35 2.31 8.4 8.4 17.9 c3540 35.46 2.73 5 33.74 1.73 16.5 4.9 36.6 c5315 31.76 3.00 7 30.07 3.01 32.3 5.3 0.003 c6288 58.17 2.90 6 56.67 1.96 36.9 2.6 32.4 c7552 37.26 3.20 8 30.07 2.88 40.9 19.3 10 dalu 33.08 2.27 3 26.91 1.60 13.9 18.7 29.5 des 37.69 3.05 63 37.07 3.25 188.2 1.6 6.5 i10 44.68 3.92 35 43.23 3.26 89.6 3.2 16.8 i8 31.89 3.41 10 23.04 2.31 28.7 27.8 32.3 k2 29.21 2.98 3 25.02 1.89 19.1 14.3 36.6 pair 27.73 3.25 6 25.96 3.04 35.8 6.4 6.5 Avg. 10.2 19.3 T able 4.4. Characterics of Sequen tial Circuits. Ckt # Slices # Nets bigk ey 1066 1375 clma 2502 4653 dsip 907 1149 planet 145 237 s13207 516 739 s38584 1954 3258 sb c 250 340 scf 254 360 st yr 124 209 tbk 147 275 F or example, according to our observ ation, the router sometimes uses unnecessary detours whic h increases the maxim um dela y Th us, w e feel that our algorithm can further impro v e its p erformance if w e ha v e b etter access to P AR whic h means w e can guide the router to route the critical paths with higher priorities. 4.4.2 Comparison with VPR W e ha v e also implemen ted the algorithm in suc h a manner that w e can compare it with VPR. VPR do es not target hierarc hical FPGAs ev en though it has the functionalit y 92 PAGE 104 T able 4.5. Comparison with Xilinx F oundation for Sequen tial Circuits. Xilinx Ours % Impro v emen t Ckt f max Time f max Time Clo c k Sp eed (MHz) (s) (MHz) (s) bigk ey 26.04 25 45.77 235 75.7 clma 19.52 46 22.68 422 16.2 dsip 47.90 16 50.41 148 5.2 planet 69.34 3 74.06 6.6 6.8 s13207 88.11 6 98.39 86.9 11.7 s38584 57.04 28 69.03 534 21.0 sb c 69.22 4 81.37 12.4 17.6 scf 56.67 3 59.73 14.0 5.4 st yr 62.37 2 68.76 4.6 10.2 tbk 47.32 3 64.69 8.1 36.7 Avg. 20.7 to cluster LUTs in to the same CLB. Once the cell lev el placemen t is nished, w e do not need to pair up CLBs as w e can only place a single CLB in eac h lo cation on the FPGA b oard. The original b enc hmarks are in \.blif format. Eac h circuit is optimized in SIS using \script.rugged" and then is tec hnologymapp ed to 4input lo okup tables (LUTs) with Flo wmap. Finally VP A CK program is used to pac k the circuit in to CLBs and the nal netlist is in \.net" format whic h can b e tak en b y VPR. W e rst run VPR to place and route eac h design using the default options. W e set the size of the target FPGA to b e 80 x 80 CLBs and w e use the le \4lut sanitized.arc h" included in the VPR pac k age as the arc hitecture le. Note that the default algorithm used for VPR's placer is timingdriv en. Next, w e run our algorithm to generate the placemen t and then feed it in to VPR for reen tran t routing. The exp erimen tal ro w to compare with VPR is similiar to the one sho wn in Figure 4.7 except that w e do not need to con v ert b et w een dieren t le formats and VPR handles the reen tran t routing instead of Xilinx. The exp erimen tal results are sho wn in T able 4.6 where D1 denotes the total net dela y and D2 denotes the critical path dela y On a v erage, compared with VPR, our algorithm reduces the total net dela y and the critical path dela y b y 11.5% and 10.7%, resp ectiv ely As for the run time, w e can see that our algorithm is sligh tly faster than sim ulated annealing based VPR placer. 93 PAGE 105 T able 4.6. Comparison with VPR. VPR Ours % Impro v emen t Ckt D1(ns) D2(ns) T(s) D1(ns) D2(ns) T(s) D1 D2 c2670 94.4 99.0 27.9 83.6 93.1 29.2 11.5 6.0 c3540 84.3 94.9 25.5 77.2 87.1 23.5 8.4 8.2 c5315 93.3 100 47.2 84.8 92.1 41.9 9.1 7.9 c6288 154 179 31.9 133.7 169.2 27.2 6.7 5.5 c7552 147 163 42.8 126 138 48.9 14.3 15.3 dalu 92.8 96.8 21.4 70.7 74.7 23.8 23.8 22.3 des 124 127 225.3 108.5 113.8 201.3 12.5 10.4 i10 167 185 118.6 142 161 111.5 14.9 13.0 i8 84.2 89.4 43.8 77.1 79.5 34.1 8.4 11.1 k2 66.8 74.1 24.2 61.3 67.5 16.5 8.2 8.9 pair 83.1 93.2 43.7 75.6 84.6 39.1 9.0 9.2 Avg. 11.5 10.7 4.5 Conclusions and F uture W ork In this c hapter, w e presen ted a netclustering based p erformancedriv en placemen t sc heme for hierarc hical FPGAs. W e dev elop ed a topdo wn design ro w to generate the placemen t for FPGAs in sev eral lev els. The main con tribution of our w ork is the forcedirected placemen t sc heme whic h is usually used in ASIC based designs. Our algorithm impro v es the p ostplaceandroute critical path dela y and a v erage netlength signican tly o v er a commercial FPGA CAD to ol from Xilinx. It also outp erforms VPR, a w ell kno wn place and route to ol from the Univ ersit y of T oron to. The impro v emen t is ac hiev ed due to the follo wing asp ects: Netlev el clustering and ro orplanning giv e a v ery go o d en try p oin t for the subsequen t forcedirected net placemen t and also sa v e the computing time o v er a purely sim ulated annealing based cell lev el placemen t. The in tro duction of attraction and repulsiv e forces help to reduce the in terconnection length eectiv ely Celllev el forcedirected approac h is v ery ecien t to optimize netlength. 94 PAGE 106 The longest path dela y orien ted I/O matc hing sc heme w orks v ery w ell in the sense of nding the b est I/O blo c k p ositions for input/output connected CLBs on critical paths. It also reduces the computing time b y decreasing the problem size. In future, w e w ould lik e to extend this w ork on impro ving routabilit y When w e determine the forces on nets and slices, w e will k eep trac k of the wire densit y in order to a v oid high trac. W e can mo dify the form ulation on calculating forces to accommo date this ob jectiv e. Besides, the I/O matc hing sc heme will b e adjusted accordingly to distribute the I/O pads ev enly in order to facilitate the routing. 95 PAGE 107 CHAPTER 5 HIGH LEVEL SYNTHESIS F OR PERF ORMANCE DRIVEN PLA CEMENT High lev el syn thesis (HLS) is a pro cess of transforming digital system from an abstract b eha vioral description to a structural sp ecication. It generates a registertransfer lev el (R TL) structure to implemen t the b eha vior. B ehav ior is kno wn as the w a y the system or its comp onen ts in teract with their en vironmen t (mapping from inputs to outputs). S tr uctur e refers to the set of in terconnected comp onen ts whic h mak e the system. Mean while, a set of optimization ob jectiv es are to b e ac hiev ed. Usually these ob jectiv es in HLS include: p erformance, area, cost, p o w er, reliabilit y and testabilit y A large amoun t of sc heduling algorithms ha v e b een prop osed b y previous researc hers [15 ] [16 ] [18 ] [19 ]. Recen tly with the increase of researc h in terest in lo w p o w er design, sev eral approac hes ha v e b een prop osed for high lev el syn thesis [105 ] [106 ] [107 ] [108 ]. As w e can see, there are n umerous algorithms prop osed in sc heduling and binding in high lev el syn thesis. Ho w ev er, v ery few algorithms ha v e b een prop osed to w ork in suc h a w a y that the high lev el syn thesis to ol is a w are of the la y out information and hence is able to generate new design accordingly to enhance the p erformance of succeeding ph ysical design. Xu and Kurdahi prop osed a la y outdriv en R TL binding tec hnique for highlev el syn thesis using accurate estimators [109 ]. Later on, they prop osed another la y outdriv en high lev el syn thesis approac h for FPGA arc hitecture [110 ]. Their w ork w as based on simple FPGA arc hitecture, i.e., Xilinx X C3000/X C4000 series, and they do not ha v e an y researc h w ork to follo w up in this area. Kim et. al. prop osed an arc hitectural high lev el syn thesis approac h whic h incorp orates a p erformancedriv en to guide the p ostla y out sc heduling [111 ]. The target arc hitecture they targeted is the distributedregister arc hitecture whic h explicitly separates the long in terconnect dela ys from logic dela ys. But this arc hitecture is irregular 96 PAGE 108 and hence easy to cause dicult y to estimate in terconnect dela y precisely W e prop ose a p erformance driv en high lev el syn thesis design ro w for FPGAs. Our high lev el syn thesis to ol is able to iterativ ely enhance the system p erformance with the guidance information obtained from the ph ysical design phase. The A UDI ( A U tomatic D esign I nstan tiation) [112 ] system dev elop ed b y the Univ ersit y of South Florida is a high lev el syn thesis to ol whic h is capable of automatically syn thesizing datapaths. Giv en a b eha vioral description of a design, it generates a structural VHDL co de. And the VHDL can b e giv en to commercial CAD to ols to p erform logic syn thesis follo w ed b y ph ysical syn thesis. Curren tly the complexit y of most placemen t and routing algorithms are nonlinear. With the amoun t of logic gates on a single FPGA c hip exceeding one million, the placemen t and routing time could b e enormously long. Hence, it is v ery imp ortan t that in the pro cess of nalizing the placemen t, w e can predict and optimize the circuit p erformance b efore the routing is executed. In this c hapter, w e study the p erformance driv en placemen t problem with high lev el syn thesis. The motiv ation is that our high lev el syn thesis can pro duce dieren t VHDL co des for the same design in a short time. So w e can ev aluate v arious options and select the one yields the b est p erformance. This c hapter is organized as follo ws. In Section 5.1, w e briery in tro duce our high lev el syn thesis system, A UDI. In Section 5.2, w e prop ose our p erformance driv en placemen t design ro w with high lev el syn thesis. In Section 5.3, w e pro vide the exp erimen tal results. In Section 5.4, w e conclude this c hapter and discuss future researc h directions. 5.1 Automatic Design Instan tiation System (A UDI) A Utomatic Design Instan tiation System [112 ] is a high lev el syn thesis system capable of automatically generating R Tlev el designs from b eha vioral descriptions. It is dev elop ed at the Univ ersit y of South Florida b y Dr. Katk o ori's researc h group. Curren tly VLSI c hip designers are fabricating CMOS transistors at v ery small feature sizes (90 nm, as of to da y 97 PAGE 109 in pro duction). This has giv en rise to new c hallenges on the design automation fron t. The k ey paradigm shift in the deepsubmicron (DSM) regime is the dominance of in terconnect phenomena suc h as wiredela y crosstalk, etc. A UDI is an in terconnectcen tric b eha vioral syn thesis system, whic h is able to syn thesize fully functional structural VHDL from a b eha vioral data ro w graph (DF G) represen tation. A t presen t, this system is capable of synthesizing datapath in tensiv e designs. V arious sc heduling algorithms ha v e b een implemen ted in the system. They v ary from simple algorithms suc h as asso onasp ossible (ASAP) and aslateasp ossible (ALAP), to complex force directed algorithms suc h as force directed sc heduling (FDS) prop osed b y P aulin and Knigh t [17 ] and the sim ultaneous sc heduling, allo cation, and mapping algorithm (SAM) prop osed b y Cloutier [113 ]. In addition, algorithms prop osed b y Gopalakrishnan and Katk o ori [107 ] [108 ] ha v e b een in tegrated in A UDI to reduce leak age p o w er. Allo cation and mapping are implemen ted using a clique partitioning heuristic prop osed b y Tseng and Siewiorek [97 ]. This approac h tries to form a minimal set of maximal sized cliques whic h results in maxim um sharing b et w een comp onen ts allo cated. In the case of functional unit (FU) mapping, the input to the clique partitioning heuristic is the compatibilit y graph of the op erations in the sc heduled DF G. In the case of register mapping, a compatibilit y graph of the edges in the DF G is formed. Tw o edges are compatible if they ha v e nono v erlapping lifetimes. Sharing FUs and registers is implemen ted using m ultiplexers or buses. The datapath can b e syn thesized using either comp onen ts from an a v ailable standard cell library or the m ultithreshold CMOS (MTCMOS) comp onen t library whic h w as dev elop ed at the Univ ersit y of Cincinnati [114 ]. The library consists of functional units (i.e., adders, subtractors, and m ultipliers), storage units (i.e., registers), and in terconnect units (i.e., buses and m ultiplexers). Figure 5.1 illustrates the R TLev el mo del syn thesized b y A UDI. The top lev el of the design instan tiates a datapath and a con troller. The datapath and the con troller comm unicate with eac h other with rags and con trol signals. They are driv en b y the same clo c k signal. Essen tially the con troller is a nite state Mo ore mac hine. The con troller generates 98 PAGE 110 con trol signals whic h enable the registers for writing and the select signals for the in terconnect units (m ultiplexers). Designs syn thesized b y the A UDI system ha v e b een justied at the R TLev el using Cadence VHDL sim ulator. La y outs are generated using the LA GER IV Silicon Compiler [115 ] and v eried using HSPICE. PrimaryInputs PrimaryOutputs ControlSignals Flags START FINISH CONTROLLER CLOCK RESET DATAPATH Figure 5.1. R TLev el Design Mo del of A UDI System. A UDI system tak es an AIF (audi in termediate format) le as its input. It describ es the b eha vioral of the design in a straigh tforw ard w a y An example design named \mx2" is giv en in Figure 5.2. Reserv ed w ord \inputs" denotes the primary inputs and \outputs" denotes the primary outputs of the system. Registers are declared with \regs" and op erations are declared with \op". In the example \mx2", a 1 is an input v ector, y out is an output v ector, r 1 is a register. Note that they are all 8bit wide. The data ro w graph (DF G) represen ting this design is sho wn in Figure 5.3(a). A sc heduling corresp onding to this DF G is sho wn in Figure 5.3(b). While generating the structural VHDL netlist for the datapath, a selfexplanatory header p ertaining to the binding of op erations to functional units and registers is also generated. The header information of the datapath corresp onding to the example sho wn ab o v e is 99 PAGE 111 inputs a1 8 y1 8 a2 8 y2 8 outputs y out 8 regs r1 8 r2 8 op1 MUL T 8 a1 a2 r1 op2 MUL T 8 y1 y2 r2 op3 ADD 8 r1 r2 y out end Figure 5.2. Beha vioral Description of Design \mx2". X X y2 y1 a2 a1 y2 y1 a2 a1 r1 r2 r1 r2 yout yout X X + + (a) (b) T1 T2 Figure 5.3. (a) DF G Represen tation of \mx2". (b) A Sc heduling for \mx2". giv en in Figure 5.4. It consists of t w o m ultipliers and one adder. The in terconnections are implemen ted b y m ultiplexers. 5.2 P erformance Driv en Placemen t with High Lev el Syn thesis In this section, w e presen t the details of our p erformance driv en placemen t approac h with high lev el syn thesis. An R TLev el VHDL description is rst generated with A UDI system. The design is syn thesized using a commercial CAD to ol. W e estimate its timing without going through the timeconsuming routing pro cess. Iterativ ely w e ev aluate the design p erformance and pro vide guidance information to the high lev el syn thesis to ol to 100 PAGE 112 { T yp e: Datapath { CDF G statistics: { Num b er of PI's: 4 { Num b er of PO's: 1 { Num b er of in ternal edges: 2 { Num b er of Op erations: 3 { T yp es of Op erations: {{ Design Flo w/Algorithm Information: { Sc heduling: ASAP { Allo cation: User's Choice { Binding: Automatic { In terconnect st yle: Mulitplexorbased {{ Design Information: { Registers: 7 { F unctional units: 3 { Resource Id=0 t yp e = MUL T : { Index = 0 t yp e= MUL T width = 8 { Mapp ed Ops = op1 { Index = 1 t yp e= MUL T width = 8 { Mapp ed Ops = op2 { Resource Id=1 t yp e = ADD : { Index = 0 t yp e= ADD width = 8 { Mapp ed Ops = op3 {{ Register Optimization Information: { Register #0 (width = 8) a1 { Register #1 (width = 8) y1 { Register #2 (width = 8) a2 { Register #3 (width = 8) y2 { Register #4 (width = 8) y out { Register #5 (width = 8) r1 { Register #6 (width = 8) r2 {{ Con troller: { T yp e: Mo ore { Num b er of states: 2 { Num b er of con trol bits: 7 Figure 5.4. Datapath Information. 101 PAGE 113 impro v e p erformance. New VHDL co de is pro duced as design en try for the CAD to ol. Finally w e run our net clustering placemen t algorithm prop osed in Chapter 4 to place and p erform reen tran t routing on the nal design. The la y out information (pintopin dela y wirelength, etc.) is also a v ailable for comparison. 5.2.1 Ov erview of the Prop osed Design Flo w High lev el syn thesis to ol has the adv an tage of generating v arious R TLev el netlists quic kly for functionally equiv alen t designs. This motiv ates our researc h in terest in studying the p erformance driv en placemen t problem with high lev el syn thesis. Our prop osed design ro w is sho wn in Figure 5.5. AUDI RTL (VHDL) Wirelength Estimation Placement Routing Information Timing Xilinx Process Stop ? Net Cluster Placement Figure 5.5. Ov erview of Design Flo w. First, w e generate the initial R TL netlist (VHDL) with A UDI system. Here, w e do not c ho ose an y sp ecic sc heduling algorithm and do not selectiv ely allo cate resources. Then w e feed this VHDL description to Xilinx's CAD to ol to p erform logic syn thesis. Next, a placemen t is p erformed on whic h w e estimate the wirelength of this design. An appro ximation of the circuit p erformance is obtained. Based on this appro ximation, w e pro vide the high lev el syn thesis to ol guidance to generate a new R TL design. A UDI c ho oses dieren t sc heduling algorithms as w ell as allo cate v arious n um b er of resources according to the guidance. This 102 PAGE 114 pro cess is rep eated un til no further p erformance impro v emen t is p ossible or the optimization has b een ac hiev ed. Finally w e place the design with our o wn net clustering based placemen t algorithm. Reen tran t routing is executed to route the design and generate the la y out information. 5.2.2 Estimation of the Design P erformance Once a design is logic syn thesized and placed, w e can obtain a coarse p erformance estimation. This means timeconsuming routing will not b e p erformed in this phase. Differen t VHDL designs are deriv ed b y A UDI for the same b eha vioral description. After logic syn thesized b y Xilinx CAD to ol, they should con tain dieren t n um b er of logic blo c ks and in terconnections. A fast and dep endable p erformance estimation algorithm is necessary to estimate the curren t design (area, routing congestion, timing, etc.) and pro vide guidance for the high lev el syn thesis to ol to generate a b etter design o v er the iterations. Basically w e use Ren t's rule [116 ] to predict in terconnection. Ren t's rule giv es an empirical relationship b et w een the n um b er of pins and the n um b er of logic blo c ks in a design, whic h tends to form a straigh t line in loglog plot. The relationship is sho wn in Equation 5.1: N p = K p N g p (5.1) Here, N p is the n um b er of pins or the n um b er of external signals connecting to the logic blo c ks, K p is a prop ortionalit y constan t whic h is the a v erage n um b er of in terconnections p er blo c k, N g is the n um b er of logic blo c ks in a logic design, and 0 < p < 1 is the R ent's exp onent [117 ]. Researc hers found that complex arc hitectures are t ypically c haracterized b y a Ren t's exp onen t range 0 : 5 < p < 0 : 8. In addition, systems with regular arc hitecture ha v e lo w Ren t's exp onen t v alue. Therefore, w e set p to 0.5 in our w ork due to the regular arc hitecture of FPGAs. Sev eral w orks ha v e b een prop osed based on Ren t's rule to predict in terconnection and routabilit y V an Marc k et. al. [118 ] prop osed a tec hnique to describ e lo cal v ariations in in terconnection complexit y It ts w ell for algorithm suc h as VPR [87 ] 103 PAGE 115 whic h uses a linear wirelength as cost function. Li and Banerji [119 ] deriv ed a statistical mo del for predicting routabilit y for hierarc hical FPGAs prior to placemen t. Bro wn et. al. [120 ] prop osed a sto c hastic mo del for estimating the c hannel densities for FPGA routing arc hitecture. It predicts routing resource requiremen ts in the p ost placemen t stage of design. In our approac h, w e adopt the metho d prop osed b y Buyuksahin and Na jm [121 ] to estimate the a v erage wirelength. It w as sho wn in [121 ] that the a v erage in terconnect length estimation error is 14.4%, whic h is acceptable in predicting the actual dela y By in tegrating Ren t's rule in our estimation approac h, w e are able to predict the in terconnection dela y consisten tly This helps us to pro vide meaningful and instructiv e guidance for our high lev el syn thesis to ol. Hence, A UDI is able to generating new VHDL netlist whic h is highly p ossible to impro v e the system p erformance. 5.2.3 Iterativ e Design Space Searc h Once a placemen t is computed, w e do not actually route it b efore w e are conden t that the design p erformance will meet our design ob jectiv e. This is mainly due to the fact that for CAD to ols the ma jorit y of the design time is sp en t on p erforming routing. Since high lev el syn thesis to ol is capable of generating v arious R TL netlists for the same design v ery ecien tly w e w ould lik e to nd the best netlist in a certain solution space b efore w e nally place and route it. The best netlist indicates the one that can b e ph ysically syn thesized to yield the b est p erformance among a set of netlists. During high lev el syn thesis, w e can c ho ose dieren t sc heduling algorithms, assign v arious n um b er of resources suc h as adders, m ultipliers, m ultiplexers, etc. This leads to a v ery large solution space. Once w e start searc hing in this space, w e w ould lik e to nd the correct searc h direction quic kly whic h means that the algorithm con v erges fast. While probing the design space, our approac h w orks incremen tally F or example, assume w e are trying to determine the n um b er of one particular t yp e of resource to b e allo cated. W e rst assign the minimal, the maximal, and the median n um b er of resources to v arious designs. With the feedbac k deriv ed from the in terconnect estimation program, w e w ould kno w in whic h 104 PAGE 116 range the n um b er of resource allo cation is preferred. Then the high lev el syn thesis to ol will try to increase/decrease the resources allo cated if it k eeps on yielding b etter estimation. T o a v oid lo cally optimal solution, w e also allo w searc hing in opp osite direction at a smaller rate. F or instance, after searc hing in one direction for 10 times, w e w ould lik e to try the other direction once. This can b e done b y k eeping a record of the start p oin t of the curren t searc hing pro cess. Therefore, our design space searc h approac h is nongreedy in nature. Mean while, w e store the b est design generated so far. Note that it do es not necessarily mean the new design generated will alw a ys outp erform the b est design as the o v erall p erformance is aected b y other factors as w ell, i.e., n um b er of other resources allo cated, sc heduling algorithm selected, etc. But through this training pro cess, w e are able to guide the high lev el syn thesis to ol to searc h to w ards the b etter direction o v erall. This iterativ e pro cess is terminated once it falls in to one of the follo wing situations: The estimation sho ws that the b est syn thesized design could satisfy the original system design ob jectiv e; There is no further impro v emen t gained after a certain n um b er of consecutiv e searc hes whic h implies that it is unlik ely to ac hiev e further p erformance gain an y more. A predened n um b er of iterations ha v e b een tested. This reduces the running time eectiv ely 5.3 Exp erimen tal Results W e ha v e dev elop ed our prop osed high lev el syn thesis ro w for p erformance driv en placemen t. Firstly our high lev el syn thesis suite (A UDI) tak es a b eha vioral data ro w graph (DF G) represen tation as its input. An R TL design in VHDL format is generated without sp ecically selecting functional resources. This design is fed in to Xilinx ISE CAD to ol and logic optimization and ph ysical syn thesize are carried with this to ol. Once the circuit is placed, w e estimate its p erformance in terms of timing, routing congestion, and area. Based on the p erformance of the curren t design, a guidance is generated for A UDI system 105 PAGE 117 to generate the next design. The ob jectiv e is to searc h the design space to w ards a b etter direction. Basically dieren t n um b er of resources (adders, subtractors, m ultiplexers, etc.) can b e allo cated and dieren t sc heduling algorithms (ASAP ALAP FDS, etc.) are used. This pro cess is rep eated un til a predened design target is met or there is no p ossible p erformance impro v emen t. Once this iterativ e pro cess is o v er, the nalized design is again giv en to Xilinx ISE for mapping. W e run our o wn net clustering algorithm to place the mapp ed circuit and nally a reen tran t routing is carried using Xilinx ISE to generate timing information. W e ha v e tested our prop osed design ro w o v er a set of b eha vioral b enc hmarks whic h can b e tak en b y A UDI system. The c haracteristics of the b enc hmarks used are sho wn in T able 5.1. In this table, the n um b er of primary inputs (PIs), the n um b er of primary outputs (POs), the n um b er of in ternal registers used, and the total n um b er of op erations of the designs are depicted. Op erations include MUL T, ADD, SUB, etc. Note that all PIs and POs are all v ectors and their widths are giv en in the corresp onding column \Width". T able 5.1. Description of Beha vioral Benc hmarks for A UDI System. Design # PIs # POs # Registers # Op erations Width iir 10 1 8 9 8 ewf 9 7 26 33 4 r 10 1 8 9 8 t4 6 4 6 10 8 latt 8 3 10 13 8 dct8ip 24 8 40 48 4 The target device w e c hose is Xilinx Virtex v800fg680 and w e used the default setting to run the Xilinx ISE to ol to mapp ed, place, and route the initial designs. Once w e ha v e our o wn placed circuits, w e use Xilinx ISE again to run reen tran t routing with the same set of default setting. The exp erimen tal results are giv en in T able 5.2, where Wir e denotes the a v erage wire dela y and Delay denotes the maxim um pintopin dela y Designs generated after going through the prop osed design ro w outp erform the corresp onding initial designs signican tly On a v erage, w e ac hiev e 24.7% reduction in maxim um dela y Mean while, w e are 106 PAGE 118 able to reduce the a v erage wire dela y b y 11.5%. The biggest gain is for the design \dct8ip" where w e successfully reduce the critical dela y b y 49.7% and reduce the a v erage wire dela y b y 38.1%. Note that for t w o designs (ewf and t4), w e increase the a v erage wire dela y but w e still get reduction in the longest path dela y W e also note that for the largest b enc hmarks, w e get the highest p ercen tage of p erformance impro v emen t. This mak es us to b eliev e with great condence that for bigger designs, whic h ha v e a larger solution space accordingly our prop osed approac h has more ro om to mak e impro v emen t. Therefore, the com bination of high lev el syn thesis and ph ysical syn thesis has excellen t p oten tial to outp erform a purely ph ysical syn thesis design sc heme. T able 5.2. High Lev el Syn thesis for P erformance Driv en Placemen t. Initial Design Final Design Impro v emen t (%) Design Wire (ns) Dela y (ns) Wire (ns) Dela y (ns) Wire Dela y iir 2.873 9.318 2.465 7.443 14.2 20.1 ewf 2.552 9.384 2.737 6.971 7.2 25.7 r 2.664 7.508 2.341 5.873 12.1 21.8 t4 2.414 9.929 2.477 8.107 2.6 18.4 latt 2.661 8.369 2.270 7.254 14.6 13.3 dct8ip 2.743 8.982 1.699 4.464 38.1 49.7 Av erage 11.5 24.8 In order to justify the accuracy of our dela y estimation sc heme, w e also run a series of exp erimen ts on a particular design \latt". The pro cedure is depicted as follo ws. After a placemen t is generated and the estimated dela y is computed, w e will pro ceed to route the design to get the actual dela y information. But the feedbac k giv en to A UDI is still generated based on the estimated dela y information. If the en tire pro cess runs for N iterations, w e will nally ha v e N estimated dela y v alues and N actual dela y v alues. Note that w e route the design in ev ery iteration only in order to compare the accuracy estimation. This is not the case while w e actually run our exp erimen tal ro w. In fact, w e will route the design only one time at the end, when w e are satised with the p erformance of nal design. W e sho w the estimated and actual dela y v alues for design \latt" in T able 5.3. On a v erage, the estimation error of a v erage in terconnect dela y is 11.5%. This accuracy is acceptable and 107 PAGE 119 hence it helps us to predict the p erformance of the R TL netlist generated b y A UDI. W e also plot the data in T able 5.3 to sho w that our space searc h approac h con v erges quic kly W e can also see that our prop osed approac h is nongreedy in nature as in some cases w e actually generate a netlist whic h results in w orse p erformance. This is useful in that our algorithm can jump out of lo cal optima. Due to the fact that w e ha v e to run the en tire ro w in a crossplatform fashion, i.e., running Xilinx in Windo ws system and A UDI in Lin ux, w e did not rep ort the run time for the b enc hmarks. Ho w ev er, excluding the time tak en b y Xilinx T o ols to parse, logic syn thesize, and place eac h design, it usually tak es less than 30 seconds to generate a new R TL netlist in eac h iteration. T able 5.3. Dela y Estimation for \latt". Iteration Estimated Dela y(ns) Actual Dela y(ns) Error (%) 1 3.46 4.06 17.3 2 3.38 3.85 13.9 3 3.14 3.42 8.9 4 3.02 2.75 9.8 5 2.61 2.89 10.7 6 2.57 2.82 9.7 7 2.58 2.31 11.7 8 2.48 2.27 9.2 Av erage 11.5 5.4 Conclusions and F uture W ork In this c hapter, w e ha v e prop osed a p erformance driv en placemen t design ro w with high lev el syn thesis. It is an iterativ e searc hing approac h whic h con v erges v ery quic kly Giv en a b eha vioral description of a design, our high lev el syn thesis to ol (i.e., A UDI) generates an R TL netlist whic h is giv en as input to Xilinx logic and ph ysical design to ol. Once the placemen t is computed, a p erformance estimation algorithm is dev elop ed to ev aluate the p erformance (area, routing congestion, timing, etc.) of this particular design. F eedbac k is giv en to A UDI system for generating the next design to w ards a b etter direction. With this 108 PAGE 120 Estimated Delay Actual Delay 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 0 1 2 3 4 5 6 7 8 9Average Delay (ns)Iterations Figure 5.6. Dela y Estimation and Cost Con v ergence for \latt". guidance, the high lev el syn thesis to ol con v erges quic kly and ends up with a design capable of impro ving the o v erall p erformance after placemen t and routing. Compared with Xilinx CAD to ol, our approac h impro v ed the critical pintopin dela y signican tly This giv es us condence that the com bination of high lev el syn thesis to ol and ph ysical design to ol has go o d p oten tial to enhance the p erformance of mo dern VLSI designs. The ma jor con tributions of our researc h include the follo wing: The in teraction b et w een high lev el syn thesis and ph ysical syn thesis is unique. It gathers ph ysical lev el information and directs the high lev el syn thesis to ol for generating new designs v ery eectiv ely Our p erformance estimation to ol is able to ev aluate the p erformance of the curren t design with consisten t accuracy Hence, it pro vides instructiv e information to the HLS to ol to searc h in the correct direction. Our net clustering placemen t algorithm is used eectiv ely to further impro v e the p erformance b y generating timingdriv en placemen t solutions. In our future researc h w ork, w e w ould lik e to extend this in the follo wing directions: 109 PAGE 121 Curren tly w e do not ha v e m uc h access to the router. Hence to gain more kno wledge of the router will b e v ery b enecial to predict the timing more accurately Add functionalities to our high lev el syn thesis to ol so that it could ha v e handy la y out information once a R TL design is generated. This can b e done to comm unicate with libraries dev elop ed for dieren t target devices. And the ph ysical lev el netlist can b e a v ailable after the R TL netlist is generated. This can exp edite the syn thesis pro cess. 110 PAGE 122 CHAPTER 6 CONCLUSIONS AND FUTURE W ORK In this dissertation, w e study logic syn thesis and ph ysical syn thesis for FPGAs. As p o w er consumption has b ecome a ma jor concern in VLSI design, precautions should b e tak en in early design phase. Sev eral lo w p o w er driv en tec hnology mapping algorithms are prop osed in this w ork for FPGAs. In Chapter 3, w e presen t net w ork ro w based approac hes to reduce dynamic p o w er consumption. Switc hing activit y at the output of a LUT aects the p o w er consumed b y this LUT, hence in tuitiv ely w e w ould lik e to minimize the total switc hing activit y of the mapp ed circuit. Ho w ev er, this strategy do es not alw a ys w ork out. Our w ork distinguishes from other researc h w ork in that while computing a subnet w ork LUT co v ering, w e consider its eect on o v erall mapping solution. W e c ho ose the one that yields the least p o w er consumption. In addition, w e dev elop a v ery ecien t incremen tal net w ork ro w computation metho d whic h exp edites the pro cess in nding the LUT co v ering. W e ac hiev e dramatic p o w er sa vings as w ell as area reduction o v er previous LUTbased FPGA lo w p o w er tec hnology mapping algorithms. This w ork is further extended to handle dela yoptimal lo w p o w er mapping. Without increasing the optimal dela y of the net w ork, w e map LUTs on noncritical path as long as p o w er deduction is obtained. Compared with a w ellkno wn sim ultaneous dela y and area optimization mapping algorithm, our approac h ac hiev es signican t p o w er reduction without increasing the optimal dela y of the mapping solution. Due to the fact that in terconnect dominates the dela y in curren t deep submicron regime, impro v e timing has b ecome a necessit y in ev ery phase of ph ysical design. Placemen t is of great imp ortance b ecause it denes the onc hip in terconnects whic h ha v e b ecome the 111 PAGE 123 b ottlenec k in enhancing the o v erall timing. W e prop ose a net clustering based placemen t algorithm whic h w orks v ery w ell for net concen tric mo dern designs. In Chapter 4, a forcedirected net cluster lev el placemen t approac h is presen ted. Eac h net cluster denotes a set of strongly connected nets whic h should b e routed in the same region. A net cluster ro orplanning is carried out using Sim ulated Annealing to optimize the total wirelength. Next, a coarse net placemen t is computed using forcedirected metho d. Then w e nd the detailed cell lev el placemen t again using forcedirected approac h. Finally a no v el maxim um w eigh ted bipartite matc hing metho d is dev elop ed to matc h the I/O pins. This metho d is v ery eectiv e b ecause it giv es priorit y to cells on critical path whic h reduces the critical path dela y In addition, it reduces the bipartite matc hing problem size whic h in turn sa v es the running time dramatically Our prop osed algorithm leads to excellen t p ostla y outdela y impro v emen t. In Chapter 5, w e presen t a design ro w with a com bination of high lev el syn thesis and net clustering placemen t for FPGAs. High lev el syn thesis is capable of generating v arious b eha vioral descriptions (e.g., VHDL) for the same design. Once w e ha v e it mapp ed and placed with a CAD to ol, w e estimate the total wirelength whic h giv es an appro ximation of the timing without actually route the design. If w e see impro v emen t o v er the iteration, w e will guide the high lev el syn thesis to ol to generate another design accordingly Finally w e place the design with our o wn net clustering placemen t algorithm. Exp erimen tal results sho w signican t critical pintopin dela y impro v emen t o v er a set of b eha vioral b enc hmarks. The w ork presen ted in this dissertation can b e further impro v ed or con tin ued in the follo wing areas: Since m ultiple p o w er supply has b ecome common in stateoftheart FPGA c hips, lo w p o w er tec hnology mapping problem should b e studied considering the eect of dieren t v oltages. T ak e dualv oltage FPGA for example, for logic comp onen ts with high switc hing activities, they can b e p o w ered using the lo w er supply v oltage in order to sa v e p o w er. When w e tak e dela y in to accoun t, logic comp onen ts on critical path are 112 PAGE 124 preferably to b e p o w ered up using the higher p o w er supply F or noncritical comp onen ts, the lo w er p o w er supply can b e used without p enalizing the o v erall p erformance. With the design complexit y has reac hed m ultimillion blo c ks on a single c hip, the compiling time has b ecome a v ery imp ortan t factor in ev aluating the p erformance of the CAD to ol. It will b e of great in terest to dev elop a ultrafast placemen t algorithm without deteriorating the p erformance to o m uc h. Quadratic programming based approac h is the most promising w a y to handle this problem. High lev el syn thesis to ol has not b ecome p opular in ED A industry curren tly mainly b ecause it fails to pro vide dep endable la y out prediction. Due to its capabilit y of probing the solution space more ecien tly and quic kly high lev el syn thesis is v ery imp ortan t for mo dern large scale designs. Hence, it should dra w large amoun t of researc h in terest in exploring the precise correlation b et w een high lev el syn thesis and la y out estimation in the y ears to come. A future CAD to ol should con tain b oth of high lev el syn thesis to ol and ph ysical syn thesis to ol and they are exp ected to w ork join tly 113 PAGE 125 REFERENCES [1] Z.H. W ang, E.C. Liu, J. Lai, and T.C. W ang. \P o w er Minimization in LUTBased FPGA T ec hnology Mapping". In Pr o c e e dings of ASPD A C pages 635{640, 2001. [2] A. H. F arrahi and M. Sarrafzadeh. \FPGA T ec hnology Mapping for P o w er Minimization". In 4th International Workshop on Field Pr o gr ammable L o gic and Applic ations pages 66{77, Septem b er 1994. [3] G. E. Mo ore. \Cramming more comp onen ts on to in tegrated circuits". Ele ctr onics 38(8):114{117, April 1965. [4] Na v eed Sherw ani. \A lgorithms for VLSI Physic al Design A utomation Klu w er Academic Publishers, 1999. [5] S. Bro wn, R. F rancis, J. Rose, and Z. V ranesic. \FieldPr o gr ammable Gate A rr ays" Klu w er Academic Publishers, 1992. [6] Xilinx Inc. \The Pr o gr ammable L o gic Data Bo ok" San Jose, CA, 1999. [7] D. Hill. \A CAD System for the Design of Field Programmable Gate Arra ys". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 187{192, 1991. [8] Actel Corp oration. \A c c eler ator Series FPGAs A CT3 F amily" 1997. [9] Actel Corp oration. \SX F amily of High Performanc e FPGAs" 2001. [10] Xilinx Inc. \Virtex 2.5 V Field Pr o gr ammable Gate A rr ays" 1998. [11] S. D. Bro wn, R. J. F rancis, J. Rose, and Z. G. V ranesic. \Field Pr o gr ammable Gate A rr ays" Klu w er Academic Publishers, 1995. [12] N. W este and K. Eshraghian. \Principles of CMOS VLSI Design: A System Persp e ctive" Addison W esley Reading, 1993. [13] M. C. McF arland, A. C. P ark er, and R. Camp osano. \T utorial on HighLev el Syn thesis". In Pr o c e e dings 25th A CM/IEEE Design A utomation Confer enc e pages 330{336, 1988. [14] R. Comp osano and W. W olf. \HighL evel VLSI Synthesis" Klu w er Academic Publishers, 1991. [15] S. Ha y anal and F. Brew er. \Automatabased sym b olic sc heduling for lo oping DF Gs. IEEE T r ansactions on Computers 50(3):250{267, Marc h 2001. 114 PAGE 126 [16] R. Camp osano. \P athbased sc heduling for syn thesis. IEEE T r ansactions on ComputerA ide d Design of Inte gr ate d Cir cuits and Systems 10(1):85{93, Jan uary 1991. [17] P G. P aulin and J. P Knigh t. \Algorithms for highlev el syn thesis. IEEE Design and T est of Computers 6(6):18{31, Decem b er 1999. [18] E. Musoll and J. Cortadella. \Sc heduling and resource binding for lo w p o w er. In Pr o c e e dings 8th International Symp osium on System Synthesis pages 104{109, 1995. [19] S. P Mohan t y and N. Ranganathan. \A framew ork for energy and transien t p o w er reduction during b eha vioral syn thesis. IEEE T r ansactions on VLSI Systems 12(6):562{ 572, June 2004. [20] S. P Mohan t y N. Ranganathan, and S. K. Chappidi. \ILP mo dels for energy and transien t p o w er minimization during b eha vioral syn thesis". In Pr o c e e dings 17th International Confer enc e on VLSI Design pages 745{748, Jan uary 2004. [21] C. P ark. T ask Sche duling in High L evel Synthesis PhD thesis, Univ ersit y of Illinois at UrbanaChampaign, 1996. [22] Altera Corp., San Jose, CA. \Pr o gr ammable L o gic Devic es Data Bo ok and Design Guide" 1994. [23] A T&T Micro electronics. \A T&T FieldPr o gr ammable Gate A rr ays Data Bo ok" A T&T Corp., Berk eley Heigh ts, NJ, 1995. [24] Actel Corp oration. \FPGA Data Bo ok and Design Guide" 1994. [25] B. W. Kernighan and S. Lin. \An ecien t heuristic pro cedure for partitioning graphs". Bel l Systems T e chnic al Journal 49:291{307, 1970. [26] C. M. Fiduccia and R. M. Mattheyses. \A lineartime heuristics for imp orving net w ork partitions". In Pr o c e e dings 19th A CM/IEEE Design A utomation Confer enc e pages 175{181, 1982. [27] L. A. Sanc his. \Multiplew a y net w ork partitioning". IEEE T r ansactions on Computers pages 62{81, 1989. [28] S. Kirkpatric k, C. D. Gelatt, and M. P V ecc hi. \Optimization b y sim ulated annealing". Scienc e 220(4598):671{68 0, Ma y 1983. [29] H. B. Bak oglu. \Cir cuits, inter c onne ctions, and p ackaging for VLSI" Addison W esley 1990. [30] Y.T. Lai, K.R. R. P an, and M. P edram. \FPGA syn thesis using function decomp osition". In Pr o c e e dings IEEE International Confer enc e on Computer Design pages 30{35, 1994. [31] B. W urth, K. Ec kl, and K. An treic h. \F unctional m ultipleoutput decomp osition". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 54{59, 1995. 115 PAGE 127 [32] C. Bhat and N. N. Chiplunk ar. \Routabilit yDriv en T ec hnology Mapping for Lo okUp T ableBased FPGAs". In Pr o c e e dings 12th International Confer enc e on VLSI Design pages 390{393, 1999. [33] A. H. F arrahi and M. Sarrafzadeh. \Complexit y of the Lo okupT able Minimization Problem for FPGA T ec hnology Mapping". IEEE T r ansactions on ComputerA ide d Design pages 1319{1332, No v em b er 1994. [34] R. J. F rancis, J. Rose, and K. Ch ung. \Chortle: A T ec hnology Mapping for Lo okup T ableBased Field Programmable Gate Arra ys". In Pr o c e e dings 27th A CM/IEEE Design A utomation Confer enc e pages 613{619, June 1990. [35] R. J. F rancis, J. Rose, and Z. V ranesic. \Chortlecrf: F ast T ec hnology Mapping for Lo okup T ableBased FPGAs". In Pr o c e e dings 28th A CM/IEEE Design A utomation Confer enc e pages 227{233, June 1991. [36] K. Karplus. \Xmap: a T ec hnology Mapp er T ablelo okup FieldProgrammable Gate Arra ys". In Pr o c e e dings 28th A CM/IEEE Design A utomation Confer enc e pages 240{ 243, June 1991. [37] Y.T. Lai, M. P edram, and S. Sastry \BDD based decomp osition of logic functions with application to FPGA syn thesis". In Pr o c e e dings 30th A CM/IEEE Design A utomation Confer enc e pages 230{235, June 1993. [38] K.C. Chen, J. Cong, Y. Ding, A. Kahng, and P T ra jmar. \D A GMap: Graphbased FPGA T ec hnology Mapping for Dela y Optimization". In IEEE Design and T est of Computers pages 7{20, 1992. [39] J. Cong and Y. Ding. \An Optimal T ec hnology Mapping Algorithm for Dela y Optimization in Lo okup T able Based FPGA Designs". In International Confer enc e on Computer A ide d Design pages 213{218, No v em b er 1992. [40] R. Murgai, N. Sheno y R. K. Bra yton, and A. L. Sangio v anniVincen telli. \Impro v ed Logic Syn thesis Algorithms for T able Lo ok Up Arc hitectures". In Pr o c e e dings IEEE International Confer enc e on Computer A ide d Design pages 564{567, No v em b er 1991. [41] P Sa wk ar and D. E. Thomas. \T ec hnology Mapping for T ableLo okUp Based Field Programmable Gate Arra ys". In A CM/SIGD A Workshop on Field Pr o gr ammable Gate A rr ays pages 83{88, F eburary 1992. [42] M. Sc hlag, J. Kong, and P K. Chan. \Routabilit yDriv en T ec hnology Mapping for Lo okUp T ableBased FPGAs". In IEEE International Confer enc e on Computer Design pages 86{90, Octob er 1992. [43] N. T oga w a, M. Sato, and T. Oh tsuki. \Maple: A Sim ultaneous T ec hnology Mapping, Placemen t and Global Routing Algorithm for FieldProgrammable Gate Arra ys". In Pr o c e e dings International Confer enc e on Computer A ide d Design pages 155{163, 1994. 116 PAGE 128 [44] J. Cong and Y. Ding. \On Area/Depth T radeO in LUTbased FPGA T ec hnology Mapping". IEEE T r ansactions on VLSI Systems 2(2):137{148, June 1994. [45] J. Cong and Y.Y. Hw ang. \Sim ultaneous Depth and Area Minimization in LUTBased FPGA Mapping". In Pr o c e e dings International Symp osium on Field Pr o gr ammable Gate A rr ays pages 68{74, 1995. [46] J.D. Huang, J.Y. Jou, and W.Z. Shen. \An Iterativ e Area/P erformance T radeo Algorithm for LUTBased FPGA T ec hnology Mapping". In Pr o c e e dings International Confer enc e on ComputerA ide d Design pages 13{17, 1996. [47] R. J. F rancis, J. Rose, and Z. V ranesic. \T ec hnology mapping of lo okup tablebased FPGAs for p erformance". In Pr o c e e dings IEEE International Confer enc e on ComputerA ide d Design pages 568{571, No v em b er 1991. [48] E. L. La wler, K. N. Levitt, and Mo dule J. T urner. \Clustering to minimize dela y in digital net w orks". IEEE T r ansactions on Computers 18:47{57, 1969. [49] J. Cong and Y. Ding. \On nominal dela y minimization in LUTbased FPGA tec hnology mapping". Inte gr ation { the VLSI Journal 18:73{94, 1994. [50] H. Y ang and D. W ong. \EdgeMap: Optimal p erformace driv en tec hnology mapping for iterativ e LUT based FPGA designs". In Pr o c e e dings IEEE/A CM International Confer enc e on ComputerA ide d Design pages 150{155, 1994. [51] R. Murgai, N. Sheno y R. K. Bra yton, and A. L. Sangio v anniVincen telli. \P erformance directed Syn thesis for T able Lo ok Up programmable gate arra ys". In Pr oc e e dings IEEE International Confer enc e on Computer A ide d Design pages 572{575, No v em b er 1991. [52] P Sa wk ar and D. E. Thomas. \P erformance directed tec hnology mapping for lo okup table based FPGAs". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 208{212, June 1993. [53] J. Cong and Y. Ding. \Bey ond the com binational limit in depth minimization for LUTbased FPGA designs". In Pr o c e e dings IEEE International Confer enc e on ComputerA ide d Design pages 110{114, 1993. [54] R. Bra yton, R. Rudell, A. Sangio v anniVincen telli, and A. W ang. \MIS: A MultipleLev el Logic Optimization System". IEEE T r ansactions on ComputerA ide d Design 6(6):1062{1081, 1987. [55] I. Levin and R. Y. Pin ter. \Realizing expression graphs using tablelo okup FPGAs". In Pr o c e e dings Eur op e an Design A utomation Confer enc e pages 306{311, Septem b er 1993. [56] N. S. W o o. \A heuristic metho d for FPGA tec hnology based on the edge visibilit y". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 248{251, June 1991. 117 PAGE 129 [57] R. Murgai, Y. Nishizaki, N. Sheno y R. K. Bra yton, and A. L. Sangio v anniVincen telli. \Logic syn thesis algorithms for programmable gate arra ys". In Pr oc e e dings A CM/IEEE Design A utomation Confer enc e pages 620{625, June 1990. [58] A. Cho wdhary and J. P Ha y es. \T ec hnology mapping for eldprogrammable gate arra ys using in teger programming". In Pr o c e e dings IEEE International Confer enc e on ComputerA ide d Design pages 361{367, No v em b er 1995. [59] V. Komm u and I. P omeranz. \GAFPGA: Genetic algorithm for FPGA tec hnology mapping". In Pr o c e e dings Eur op e an Design A utomation Confer enc e pages 300{305, Septem b er 1993. [60] J. Cong and Y. Ding. \An optimal tec hnology mapping algorithm for dela y optimization in lo okuptable based FPGA designs". In Pr o c e e dings IEEE International Confer enc e on ComputerA ide d Design pages 48{53, 1992. [61] H. Li, W.K. Mak, and S. Katk o ori. \LUTBased FPGA T ec hnology Mapping for P o w er Minimization with Optimal Depth". In Pr o c e e dings IEEECS Workshop on VLSI pages 123{128, 2001. [62] H. Li, W.K. Mak, and S. Katk o ori. \An Ecien t LUTBased FPGA T ec hnology Mapping Algorithm for P o w er Minimization". In Pr o c e e dings Asia and South Pacic Design A utomation Confer enc e pages 353{358, 2003. [63] H. Li, S. Katk o ori, and W.K. Mak. \P o w er Minimization Algorithms for LUT based FPGA T ec hnology Mapping". A CM T r ansactions on Design A utomaton of Ele ctr onic Systems (TOD AES) 9(1):33{51, Jan uary 2004. [64] P Villarrubia. \Imp ortan t placemen t considerations for mo dern VLSI c hips". In Pr o c e e dings International Symp osium on Physic al Design page 6, 2003. [65] S. M. Sait and H. Y oussef. \VLSI Physic al Design A utomation: The ory and Pr actic e" W orld Scien tic Publishing, 1999. [66] J. Kleinhans, G. Sigl, F. Johannes, and K. An treic h. \Gordian: VLSI placemen t b y quadratic programming and slicing optimization". IEEE T r ansactions on ComputerA ide d Design pages 356{365, Marc h 1991. [67] M. W ang, X. Y ang, and M. Sarrafzadeh. \ Dragon2000: Standardcell placemen t to ol for large industry circuits". In Pr o c e e dings A CM/IEEE International Confer enc e on ComputerA ide d Design pages 260{263, 2002. [68] W. Sun and C. Sec hen. \Ecien t and eectiv e placemen t for v ery large circuits". IEEE T r ansaction on ComputerA ide d Design of Inte gr ate d Cir cuits and Systems pages 349{359, Marc h 1995. [69] A. E. Caldw ell, A. B. Kahng, and I. L.Mark o v. \Can Recursiv e Bisection Alone Produce Routable Placemen ts?". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 477{482, 2000. 118 PAGE 130 [70] C.CChang, J. Cong, Z. P an, and X. Y uan. \Ph ysical hierarc h y generation with routing congestion con trol". In Pr o c e e dings International. Symp osium Physic al Design pages 36{41, 2002. [71] W. Sw artz and C. Sec hen. \Timing driv en placemen t for large standard cell circuits". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 211{215, 1995. [72] T. Hamada, C.K. Cheng, and P .M. Chau. \Prime: A timingdriv en placemen t to ol using a p eicewise linear resistiv e net w ork approac h". In Pr o c e e dings of the IEEE/A CM Design A utomation Confer enc e pages 531{536, June 1993. [73] T. Kong. \A no v el net w eigh ting algorithm for timingdriv en placemen t". In Pr o c e e dings of the IEEE/A CM International Confer enc e on Computer A ide d Design pages 172{176, 2002. [74] M. Hutton, K. Adibsamii, and A. Lea v er. \Timingdriv en placemen t for hierarc hical programmable logic devices". In A CM Symp osium on FPGAs pages 3{11, 2001. [75] C. Y. Lee. \An algorithm for path connection and its application". IRE T r ansactions on Ele ctr onic Computers 1961. EC10. [76] T. H. Cormen, C. E. Leiserson, R. L. Riv est, and C. Stein. \Intr o duction to algorithms, se c ond e dition The MIT Press, 2001. [77] M. A. Breuer. \A class of mincut placemen t algorithms". In Pr o c e e dings Design A utomation Confer enc e pages 284{290, 1977. [78] M. A. Breuer. \Mincut placemen t". Journal Design A utomation and F aulttoler ant Computing pages 343{382, Octob er 1977. [79] A. Dunlop and B. W. Kernighan. \A pro cedure for placemen t of standardcell VLSI circuits". IEEE T r ansactions on Computer A ide d Design pages 92{98, 1985. [80] D. Huang and A. Kahng. \P artitioningbased standardcell global placemen t with an exact ob jectiv e". In A CM Symp osium on Physic al Design pages 18{25, 1997. [81] T. Chan, J. Cong, T. Kong, and J. Shinnerl. \Multilev el optimization for largescale circuit placemen t". In Pr o c e e dings IEEE/A CM International Confer enc e on ComputerA ide d Design pages 171{176, 2000. [82] K. M. Hall. \An r dimensional quadratic placemen t algorithm". Management Scienc e 17:219{229, 1970. [83] A. Sriniv asan, K. Chaudhary and E. S. Kuh. \RITUAL: P erformance driv en placemen t algorithm of smallcell ICs". In Pr o c e e dings A CM/IEEE International. Conferenc e on Computer A ide d Design pages 48{51, 1991. [84] J. Vygen. \Algorithms for largescale rat placemen t". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 746{751, 1997. 119 PAGE 131 [85] H. Eisenmann and F. M. Johannes. \Generic global placemen t and ro orplanning". In Pr o c e e dings A CM/IEEE Design A utomation c onfer enc e pages 269{274, June 1998. [86] C. Sec hen and A.L. Sangio v anniVincen telli. \Tim b erw olf 3.2: A new standard cell placemen t and global routing pac k age". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 432{439, 1986. [87] V. Betz and J. Rose. \VPR: A new pac king, placemen t and routing to ol for FPGA researc h". In Pr o c e e dings International. Workshop on FieldPr o gr ammable L o gic and Applic atons pages 213{222, 1997. [88] B. Hu and M. MarekSado wksa. \Fine gran ularit y clustering for large scale placemen t problems". In Pr o c e e dings International Symp osium on Physic al Design pages 67{74, 2003. [89] J. Cong, M. Romesis, and M. Xie. \Optimalit y scalabilit y and stabilit y study of partitioning and placemen t algorithms". In International Symp osium on Physic al Design pages 88{94, 2003. [90] V. George and J. M. Rabaey \L owEner gy FPGAs" Klu w er Academic Publishers, 2001. [91] CC. W ang and CP Kw an. \Lo w P o w er T ec hnology Mapping b y Hiding Hightransition P aths in In visible Edges for LUTbased FPGAs". In IEEE International Symp osium On Cir cuits and Systems pages 1536{1539, 1997. [92] F. Na jm. \T ransition Densit y: A New Measure of Activit y in Digital Circuits". IEEE T r ansactions on ComputerA ide d Design of Inte gr ate d Cir cuits and Systems pages 310{323, F eburary 1993. [93] Altera Corp. \Devic e Data Bo ok" 1999. [94] A. Marquardt, V. Betz, and J. Rose. \Timingdriv en placemen t for FPGAs". In Pr o c e e dings A CM Symp osium on FPGAs pages 203{213, 2002. [95] S. Nag and R. Ruten bar. \P erformancedriv en sim ultaneous place and route for ro wbased FPGAs". In International Confer enc e on ComputerA ide d Design 1995. [96] S. Alup oaei and S. Katk o ori. \Netbased forcedirected macro cell placemen t for wirelength optimization". Journal of VLSI Signal Pr o c essing Systems pages 151{163, Ma y 2004. [97] C. Tseng and D. P Siewiorek. \F acet: a pro cedure for the automated syn thesis of digital systems". In Pr o c e e dings A CM/IEEE Design A utomation Confer enc e pages 490{496, 1988. [98] H. Murata, K. F ujiy oshi, S. Nak atak e, and Y. Ka jitani. \VLSI mo dule placemen t based on rectanglepac king b y the sequencepair". IEEE T r ansactions on Computer A ide d Design pages 1518{1524, Decem b er 1996. 120 PAGE 132 [99] M. Huang, F. Romeo, and A. Sangio v anniVincen telli. \An Ecien t General Co oling Sc hedule for Sim ulated Annealing". In Pr o c e e dings International Confer enc e ComputerA ide d Design pages 381{384, 1986. [100] N. R. Quinn and M. A. Breuer. \A forced directed comp onen t placemen t pro cedure for prin ted circuits b oards". IEEE T r ansactions on Cir cuits Systems pages 377{388, 1979. [101] R. Kress. \Numeric al A nalysis" SpringerV erlag New Y ork Inc., 1998. [102] F. Mo, A. T abbara, and R. K. Bra yton. \A forcedirected macro cell placer". In Pr oc e e dings International Confer enc e on Computer A ide d Design pages 177{180, 2000. [103] H. W. Kuhn. \The Hungarian metho d for the assignment pr oblem" Na v al Researc h Logistics Quarterly 2(1):8397, 1955. [104] H. Li, S. Katk o ori, and W.K. Mak. \F orcedirected p erformance driv en placemen t algorithm for FPGAs". In Pr o c e e dings ISVLSI 2004 pages 193{198, 2004. [105] A. K. Muruga v el and N. Ranganathan. \Gate Sizing and Buer Insertion using Economic Mo dels for P o w er Optimization". In International Confer enc e on VLSI Design pages 195{200, 2004. [106] S. P Mohan t y N. Ranganathan, and S. K. Chappidi. \P o w er Fluctuation Minimization During Beha vioral Syn thesis using ILPBased Datapath Sc heduling". In International Confer enc e on Computer Design pages 441{443, 2003. [107] C. Gopalakrishnan and S. Katk o ori. \T abu Searc h Based Beha vioral Syn thesis of Lo w Leak age Datapaths". In IEEE Symp osium on VLSI pages 260{261, 2004. [108] C. Gopalakrishnan and S. Katk o ori. \KnapBind: An AreaEcien t Binding Algorithm for Lo wleak age Datapaths". In International Confer enc e on Computer Design pages 430{435, 2003. [109] M. Xu and F. J. Kurdahi. \La y outdriv en R TL binding tec hniques for highlev el syn thesis using accurate estimators". A CM T r ansactions on Design A utomation of Ele ctr onic Systems 2(4):312{343, 1997. [110] M. Xu and F. J. Kurdahi. \La y outDriv en High Lev el Syn thesis for FPGA Based Arc hitectures". In Pr o c e e dings Design A utomation and T est in Eur op e pages 446{ 450, 1998. [111] D. Kim, J. Jung, S. Lee, J. Jeon, and K. Choi. \Beha viortoplaced R TL syn thesis with p erformancedriv en placemen t". In Pr o c e e dings IEEE/A CM International Confer enc e on ComputerA ide d Design pages 320{325, 2001. [112] Univ ersit y of South Florida. \A UDI A Utomatic Design Instan tiation". http:// vcapp.csee.usf.edu/~katko ori/ kkwe b/a udi. html 121 PAGE 133 [113] R. J. Cloutier and D. E. Thomas. \The com bination of sc heduling, allo cation, and mapping in a single algorithm". In Pr o c e e dings 27th A CM/IEEE Design A utomation Confer enc e pages 71{76, 1990. [114] S. Katk o ori. \Behavior al pr oling b ase d high level p ower estimation metho dolo gies for VLSI ASIC and FPGA synthesis" PhD thesis, Univ ersit y of Cincinnati, 1996. [115] Univ ersit y of California Berk eley \L ager IV R ele ase 4.0" 1991. [116] B. S. Landman and R. L. Russo. \On pin v ersus blo c k relationship for partitions of logic circuits". IEEE T r ansactions on Computers 20:1469{1479, 1971. [117] H. B. Bak oglu. \Cir cuits, Inter c onne ctions, and Packaging for VLSI" AddisonW esley 1990. [118] H. V an Marc k, D. Stro obandt, and J. V an Camp enhout. \to w ard an extension of ren t's rule for describing lo cal v ariations in in terconnection complexit y". In 4th International Confer enc e for Y oung Computer Scients pages 136{141, 1995. [119] W. Li and D. K. Banerji. \Routabilit y prediction for hierarc hical FPGAs". In Gr e at L akes Symp osium on VLSI pages 256{259, 1999. [120] S. Bro wn, J. Rose, and Z. G. V ranesic. \A sto c hastic mo del to predict the routabilit y of eld programmable gatearra ys". IEEE T r ansactions on CompuerA ide d Design of Inte gr ate d Cir cuits and Systems pages 1827{1838, Decem b er 1993. [121] K. M. Buyuksahin and F. N. Na jm. \Highlev el p o w er estimation with in terconnect eects". In IEEE International Symp osium on L ow Power Ele ctr onics and Design pages 197{202, 2000. 122 PAGE 134 ABOUT THE A UTHOR Hao Li w as b orn in Beijing, China in 1972. He receiv ed the Bac helor of Engineering degree in T elecomm unications Engineering in 1995 from Beijing Univ ersit y of P osts and T elecomm unications, Beijing, China. He receiv ed the Master of Science degree in Electrical Engineering from Beijing Univ ersit y of P osts and T elecomm unications, Beijing, China in 1999. He then joined the Ph.D. program in the Departmen t of Computer Science and Engineering at the Univ ersit y of South Florida, USA. His researc h in terests include VLSI design, ph ysical design for FPGAs, electronic design automation, HighLev el syn thesis for FPGAs. He has published a n um b er of researc h pap ers in areas of VLSI Design and CAD. His pap er w as nominated for the b est pap er a w ard at the Asian and South P acic Design Automation Conference in 2004. He is a mem b er of IEEE and A CMSIGD A. 