|
|
FPGA Circuits, Architectures, and Physical Synthesis for Power Efficiency, Process Variation, and Reliability
Primary Investigator (PI)
Prof. Lei He
Attended Students
1. Lerong Cheng
2. Zhe Feng
3. Yu Hu
4. Fei Li
5. Yan Lin
6. Vikram M.N. Rao
7. Phoebe Wang
Funding sources
CCR-0306682, "Design and Synthesis of Power Efficient Programmable Fabric"
UC MICRO 05, "Power Efficient and Variation Tolerant FPGA"
Related Projects
Synthesis and Verification for Heterogeneous FPGA and Defect Tolerance
Research Outcomes
The configurability at the circuit level, particularly the field programmable gate array (FPGA) attracts a great deal of attention due to its short time to market and low non-recurring costs. However, FPGA consumes much higher power compared to custom designed circuits. This project is aimed to reduce FPGA power and increase parametric yield rate. Our contributions in the area include:
Power modeling and characteristics of FPGA architecture. We develop a mixed level cycle-accurate FPGA power simulator and studied the power characteristics of existing FPGA architecture. We show that interconnect power is dominant and leakage power has a growing significance for the existing FPGA architecture, and both should be focused for power reduction. For existing FPGA architecture, the min-area architecture also has the minimal energy but has the largest delay. Tuning the LUT (lookup table) and cluster sizes lead to the high-performance architecture with 0.7x delay but 2.3x energy compared to the min-energy architecture. The results were presented in [C32,C62,J20]. Recently, Small gates, such as AND2, XOR2 and MUX2, have been mixed with lookup tables (LUTs) inside the programmable logic block (PLB) to reduce area and power and increase performance inFPGAs. However, it is unclear whether incorporating macro-gates with wide inputs inside PLBs is beneficial. Therefore, we evaluate the proposed heterogeneous FPGA using the our newly developed flow and show that mixing LUT and macro-gates, both with 6 inputs, improves performance by 16.5% and reduces logic area by 30% compared to using merely 6-input LUTs. The work was presented inICCAD'07. [C110]
Field programmable power supply for power reduction. To reduce power, we show that the fixed dual-Vdd patterns similar to those in ASIC may lead to extra placement constraints and excessive long interconnects and therefore they are not effective to reduce energy in FPGA. To compensate this, we introduced the concept of field programmable Vdd (including Vdd-level selection and power-gating) for power and timing co-optimization, and designed FPGA circuits and fabrics to provide field programmability of power supply for both logic blocks and interconnects. This concept of programmable supply voltage provides an extra dimension of field programmability that was never studied for FPGA and reduces FPGA energy-delay product by 29% compared to single-Vdd FPGAswith the Vdd level suggested by ITRS. This work was presented in ISFPGA04[C44], DAC'04 [C52], ICCAD'04 [C56], and ASPDAC’05[C59] considering logic blocks and interconnects respectively. Some new result was also presented in [J28].
Architecture Evaluation with Vdd programmability. The Vdd-programmable FPGA circuits in [C52,C56] may increase the number of SRAM cells for configuration by over 100%. We designed novel circuits to obtain field programmable Vdd gating (or Vdd selection) with negligible SRAM cell increase in the chip level, or obtain field programmable Vdd selection and Vdd gating with 28% SRAM cell increase in the chip level. We conducted a comprehensive FPGA architecture evaluation and drew the following conclusions: (i) dual-Vdd selection for logic clusters plus power gating without dual-Vdd for interconnects is the best architecture choice considering timing, area, and power, and it reduces the power by 2x at about 17% area overhead but with virtually no SRAM cell increase; (ii) compared to the existing area-optimal architecture, the power-optimal architecture has the same LUT size of 4; and (iii) LUT size of 7 leads to minimal delay no matter whether Vdd-programmability is used or not. This work was presented in ISFPGA'05 [C62] and TVLSI [J18].
Device and Architecture Co-optimization. We conducted the first published co- optimization of device (Vdd, Vt and sleep transistor size) and FPGA architecture. We developed a trace-based FPGA power and timing evaluation method. It is orders of magnitude faster than cycle-accurate simulation [C32] and enables effective exploration in the huge solution space considering architecture and device co-optimization. Compared to the baseline case similar to the state-of-the-art industrial FPGA architecture followed by device tuning, our co-optimization can reduce energy-delay product by 18.4% and chip area by 23% without power gating, or reduce energy-delay product by 58% with a 8.3% chip area increase due to sleep transistors for power gating. This work was presented in DAC'05 [C70] and TCAD'07[J32]. Furthermore, we incorporated process variations in device and architecture co-optimization and concluded that LUT size 4 gives the highest leakage yield and LUT size 7 gives the highest timing yield. Considering both leakage and timing limits, LUT size 5 achieves the maximum combined leakage and timing yield. The paper was presented in ICCAD'05 [C77].
Dual-Vdd interconnects with chip-level timing slack allocation. Our earlier work [C56] assumed that Vdd-level converters are used inside dual-Vdd interconnects. These converters lead to an extra power overhead and decrease the power reduction. In a paper to appear at DAC'05 [C69], we proposed that dual-Vdd can be applied inside a routing tree with no Vdd-level converter if we only allow high-Vdd buffers driving low-Vdd buffers. This new formulation reduces power significantly compared to all existing dual-Vdd FPGA interconnects. We also developed a chip-level time slack allocation algorithm using linear programming to maximize power reduction by dual-Vdd buffers. The linear programming is enabled by a closed-form relationship between slack and low-Vdd buffers that can be used by a routing tree. The paper was presented in TCAD'06 [J25]. A linear programming (LP) based time slack allocation algorithm, EdTLC-LP, has been proposed recently for Vdd-programmable interconnects without using Vdd-level converters for mixed wire lengths [C87]. However, it takes a long time to solve the LP problem for time slack allocation. In the paper ISLPED'06 [C91], we develop EdTLC-NW, a slack allocation algorithm based on min-cost network flow to reduce runtime. However, the deterministic Vdd assignment leverages timing slack exhaustively and significantly increases the number of near-critical paths, which results in a degraded timing yield with process variation. In the paper DATE'07 [C101], we present two statistical Vdd assignment algorithms.
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction. Field programmable dual-Vdd interconnects are effective to reduce FPGA power. Assuming uniform length interconnects, existing work has developed time slack budgeting to minimize power based on estimating the lower bound of power reduction using dual-Vdd for given time slack. In this paper, we show that such lower bound estimation cannot be extended to mixed length interconnects that are used in modern FPGAs. We develop a technique to estimate power reduction using dual-Vdd for mixed length interconnects, and apply linear programming (LP) to solve slack budgeting to minimize power for mixed length interconnects. Experiments show 53% power reduction on average compared to single-Vdd interconnects. Furthermore, this paper presents a simultaneous retiming and slack budgeting algorithm to reduce power in dual-Vdd FPGAs considering placement and flip-flop binding constraints. The algorithm is based on mixed integer and linear programming (MILP) and achieves up to 20% power reduction compared to retiming followed by slack budgeting. We propose a runtime efficient flow to apply simultaneous retiming and slack budgeting only when it is necessary. To the best of our knowledge, this paper is the first in-depth study of simultaneous retiming and slack budgeting for dual-Vdd programmable FPGA power reduction while considering layout constraints. This work was presented in DAC'06 [C87] and TODAES [J39].
Place and Timing for FPGAs Considering Variations. Process variation affecting timing and power is an important issue for modern integrated circuits in nanometer technologies. We provided the first work FPL'06 [C88] which studied in depth on applying statistical timing analysis with cross-chip and on-chip variations to speed-binning and guard-banding in FPGAs. This work was presented in FPL'06 [C88] and [J35]. Then, in the paper FPL'06 [C89], we proposed a solution to optimize FPGA performance via chip-wise placement considering process variations, which improves circuit performance by up to 19.3% for the tested variation maps compared to the existing FPGA placement. [J37] [C96]
Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping. The Boolean matching problem is a key procedure in technology mapping for heterogeneous Field Programmable Gate Arrays (FPGA), and SAT-based Boolean matching (SAT-BM) provides a highly flexible solution for various FPGA architectures. However, the computational complexity of state-of-the-art SAT-BM prohibits its application practically. In this paper we propose an efficient SAT-BM algorithm by exploring function and architectural symmetries. While the most recent work obtained up to 13x speedup, we achieve up to 200x speedup, when both are compared to the original SAT-BM algorithm. This work was presented in IWLS'07 [C102] and ICCAD'07 [C109].
Device and Architecture Concurrent Optimization for FPGA Transient Soft Error Rate Late CMOS scaling reduces device reliability, and existing work has studied the permanent SER (soft error rate) for configuration memory in FPGA extensively. In this paper, we show that continuous CMOS scaling dramatically increases the significance of FPGA chip-level transient soft errors in circuit elements other than configuration memory, and transient SER can no longer be ignored. We then develop an efficient, yet accurate, transient SER evaluation method, called trace based methodology, considering logic, electrical and latch-window maskings. By collecting traces on logic probability and sensitivity and re-using these traces for different device settings, we finally perform device and architecture concurrent optimization considering hundreds of device and architecture combinations. Compared to the commonly used FPGA architecture and device settings, device and architecture concurrent optimization can reduce the transient SER by 2.8X and reduce the product of energy, delay and transient SER by 1.8X. The work was presented in ICCAD'07 [C111]
Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability The work develops a trace-based framework to enable concurrent process and FPGA architecture co-development. The user can tune eight parameters for bulk CMOS processes and obtain the chip level performance and power distribution and soft error rate (SER) considering process variations and device aging. The framework is efficient as it is based on closed-form formulas. It is also flexible as process parameters can be customized for different FPGA elements and no SPICE models and simulations are needed for these elements. Therefore, this framework is suitable for early stage process and FPGA architecture co-development. The paper further presents a few examples to utilize the framework. We show that applying heterogeneous gate lengths to logic and interconnect may lead to 1.3X delay difference, 3.1X energy difference, and reduce standard deviation of leakage variation by 87%. This offers a large room for power and delay tradeoff. We further show that the device aging has a knee point over time, and device burning to reach the point could reduce the performance change over 10 years from 8.5% to 5.5% and reduce die to die leakage significantly. In addition, we also study the interaction between process variation, device aging and SER. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. Moreover, we also find that neither device aging due to NBTI and HCI nor process variation have significant impact on SER. The work was presented in FPGA'08 [C114]
References
[J18] Yan Lin, Fei Li and Lei He, "Circuits and Architectures for Field Programmable Gate Array with Configurable Supply Voltage", accepted by IEEE Transactions on Very Large Scale Integration Systems, 13 pages. (pdf).
[J20] Fei Li and Lei He, "Power Modeling and Characteristics of Field Programmable Gate Arrays", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13 pages, October 2005. (pdf).
[J25] Y. Lin and L. He, "Dual-Vdd Interconnect with Chip-level Time Slack Allocation for FPGA Power Reduction," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 25, Issue 10, October 2006, pages: 2023 - 2034. (pdf)
[J28] Fei Li, Yan Lin, and Lei He, "Field Programmability of Supply Voltages for FPGA Power Reduction", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.26, No.4, April, 2007. (pdf)
[J32] Cheng, L., Li, F., Lin, Y., Wong, P. and He, L, "Device and ArchitectureCooptimization for FPGA Power Reduction" Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 26, Issue 7, July 2007 Page(s):1211 - 1221 (link)
[J35] Yan Lin, Mike Hutton and Lei He, "Statistical Placement for FPGAs considering process variation," IET Computers & Digital Techniques, 2007.
[J37] Yan Lin, Lei He and Mike Hutton, "Stochastic Physical Synthesis Considering Pre-routing Interconnect Uncertainty and Process Variation for FPGAs" accepted by IEEE Transactions on Very Large Scale Integration Systems.
[J39] Yu Hu, Yan Lin, Lei He and Tim Tuan, "Physical Synthesis for FPGA Interconnect Power Reduction by Dual-Vdd Budgeting and Retiming" accepted by ACM Transactions on Design Automation of Electronic Systems (TODAES).
[C32] F. Li, D. Chen, L. He and J. Cong, "Architecture Evaluation for Power Efficient FPGAs", ACM International Symposium on Field Programmable Gate Array, 175-184, February 2003. (pdf)
[C44] F. Li, Y. Lin, L. He and J. Cong, "Low-power FPGA using Dual-Vdd/Dual-Vt Techniques", the Twelfth International Symposium on Field Programmable Gate Arrays, pages: 42-50, February 2004. (pdf)
[C52] F. Li, Y. Lin and L. He, "FPGA Power Reduction Using Configurable Dual-Vdd", IEEE/ACM Design Automation Conference, pp. 735-740, June 2004. (pdf)
[C56] F. Li, Y. Lin and L. He, "Vdd Programmability to Reduce FPGA Interconnect Power", IEEE/ACM International Conference on Computer-Aided Design, pp. 760-765, San Jose, Nov. 2004. (pdf)
[C59] Y. Lin, F. Li and L. He, "Routing Track Duplication with Fine-Grained Power-Gating for FPGA Interconnect Power Reduction", IEEE/ACM Asia and South Pacific Design Automation Conference, Shanghai, China, Jan. 2005. p645-650(pdf) (ppt)
[C62] Y. Lin, F. Li and L. He, "Power modeling and architecture evaluation for FPGA with novel circuits for Vdd programmability", the Thirteenth International Symposium on Field Programmable Gate Arrays, pp. 199-207, Feb. 2005. (pdf)
[C69] K. Tam and L. He, "Power-Optimal Dual-Vdd Buffered Tree Considering Buffer Stations and Blockages", Design Automation Conference, Anaheim, CA, pp. 497-502, June 2005(pdf) (ppt).
[C70] Y. Lin, and L. He, "Leakage efficient chip-level dual-vdd assignment with time slack allocation for FPGA power reduction", Design Automation Conference, pp. 720-725, June 2005. (pdf, ppt).
[C71] Cheng, L., Li, F., Lin, Y., Wong, P. and He, L, "Device and Architecture Co-optimization for FPGA Power Reduction" Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 26, Issue 7, July 2007 Page(s):1211 - 1221 (link) (pdf).
[C77] P. Wong, L. Cheng, Y. Lin and L. He, "FPGA Device and Architecture Evaluation Considering Process Variation," Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), San Jose, CA, pp. 19-24, Nov. 2005. (pdf)
[C87] Yu Hu, Yan Lin, Lei He and Tim Tuan, "Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction", in proceedings of IEEE/ACM Design Automation Conference, San Francisco, pp. 478-483, CA, July 2006.(pdf).
[C88] Yan Lin, Mike Hutton and Lei He, "Placement and Timing for FPGAs Considering Variations", International Conference on Field Programmable Logic and Applications, August 2006. (pdf).
[C89] Lerong Cheng, Jinjun Xiong, Lei He, "FPGA Performance Optimization via ChipwisePlacement Considering Process Variations", in International Conference on Field Programmable Logic and Applications, August 2006. (pdf).
[C91] Yan Lin, Yu Hu and Lei He, "An Efficient Chip Level Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction", International Symposium on Low Power Electronics and Design, October 2006. (pdf) (ppt).
[C96] Yan Lin and Lei He, "Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation", IEEE/ACM International Symposium on Field-Programmable Gate Arrays, Monterey, California, 80-88, Feb 2007 (pdf) (ppt)
[C101] Yan Lin and Lei He "Statistical Dual-Vdd Assignment for FPGA Interconnect Power Reduction ", IEEE/ACM Design Automation and Test in Europe, 636-641, April 2007 (pdf)
[C102] Yu Hu, Victor Shih, Rupak Majumdar and Lei He, "Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping ", IWLS, 2007. (pdf) (ppt)
[C103] Yu Hu, Satyaki Das and Lei He, "Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates ", IWLS, 2007. (pdf) (ppt)
[C109] Yu Hu, Victor Shih, Rupak Majumdar and Lei He, "Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping", IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), San Jose, CA, Nov. 2007. (pdf) (ppt)
[C110] Yu Hu, Satyaki Das, Steve Trimberger and Lei He, "Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates", IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), San Jose, CA, Nov. 2007. (pdf) (ppt)
[C111] Yan Lin and Lei He, "Device and Architecture Concurrent Optimization for FPGA Transient Soft Error Rate", IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), San Jose, CA, Nov. 2007. (pdf) (ppt)
[C114] Lerong Cheng, Yan Lin, Lei He, and Yu Cao, "TraceBased Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability", Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, 2008. To be appeared. (pdf)
|