System-in-Package and 3D Integration
Worst-case crosstalk noise in high-speed buses: In this project, we revealed that such noise depends on interaction between logic patterns and timing alignment of logic signals. We showed the routing direction has a significant impact under RLC model. When there are no timing window constraints, we showed that the commonly used superposition algorithm results in 15% underestimation on average, and proposed a new SS+AS algorithm that has virtually the same complexity as the superposition algorithm but has a much improved accuracy. On average the SS+AS algorithm underestimates WCN by 3% compared with simulated annealing and genetic algorithm. We also showed that applying RC model to the high-speed interconnects in the ITRS 0.10um technology can underestimate the noise by up to 80%. This work was presented in [C30, C31]. We further extended our algorithm to consider aggressor switching windows and the victim sampling window. The extended SS+AS algorithm well approximates WCN with 3% underestimation on average. The results were presented in [C42,J18].
Efficient time-domain model for transmission line: Transmission line effects become increasingly significant for on-chip high-speed interconnects. Efficient and accurate transmission line models are required for analysis and synthesis of such interconnects. In this project we proposed an piece-wise liner model for the far-end response of a single transmission line considering ramp input and capacitive loading. Our model divides the time axis into a number of regions according to the time of flight and the input rising time, and then approximates the far-end response by piece-wise linear waveform in each region. The PWL model is at least 1000x faster than SPICE simulation with high accuracy. We further applied the PWL model to calculate the delay, rising time and oscillation amplitude of the coplanar waveguide(CPW) structure, and achieved less than 10% average error compared with SPICE simulation. The model was presented in [C40, J16]
Modeling and synthesis of transmission line for multi-channel communication: To overcome the limitations of traditional interconnects, multi-channel interconnects that transmit signals via high frequency carriers have recently been proposed and realized for intra-chip and inter-chip communication. To efficiently design such transmission line based interconnects, we derived a closed-form model for signal-to-noise ratio (SNR) considering multiple ports and branches and propose efficient figures of merit (FOMs) to minimize the signal distortion. Using the proposed models, we automatically synthesized coplanar waveguides for radio-frequency (RF) interconnects with capacitive couplers. We minimized the total interconnect area under the constraints of SNR and signal distortion. Compared with the published manual designs, the synthesized solutions can reduce up to 80% area. This work was presented in [C75, J24].
Signaling by Staggered and Twisted Bundle: Existing shield insertion for multiple signal nets may lead to non-uniformly distributed capacitive coupling-length and inductive returned-path, which introduces large delay and delay variation by crosstalk. This paper discusses the design and test of a twisted and staggered interconnect structure to reduce both inductive and capacitive crosstalk. A transmission line model is introduced and an automatic layout synthesis is presented. Moreover, the proposed design is fabricated in the IBM 0.13um process and tested by an on-chip time-domain sampling circuit. As shown by measured results, our proposed design reduces delay by 25% and reduces delay variation by 25X when compared to designs employing coplanar shields [J49]
Post-silicon tuning for digitally tuned analog circuits: Joint design time and post-silicon optimization for analog circuits has been an open problem in literature, given the complex nature of analog circuit modeling and optimization. We formulate the co-optimization problem for digitally tuned analog circuits as to optimize the parametric yield subject to power and area constraint. A general optimization framework combing the branch-and-bound algorithm and gradient ascent method is proposed. We demonstrate our framework with two examples in high-speed serial link, the transmitter design and the phase-locked-loop (PLL) design. Experimental results show that compared with a manual design approach commonly used by analog designers, joint design-time and post-silicon optimization can improve the yield by up to 47% for transmitter design and up to 56% for PLL design under the same area and power constraints. Some initial results have been published at [C126](ii) Beyond-die placement and routing
Constraint driven I/O planning and placement for chip-package co-design: System-on-chip and system-in-package result in the increased number of I/O cells and more complicated constraints from both chip and package designs. This renders the traditional manually tuned (or chip-centered) I/O design suboptimal in terms of both turn around time and design quality. Considering chip and package co-design, we proposed a general flip-chip I/O placement model suitable for chip-package co-design, and define a set of design constraints considering both chip and package requirements. We then formulated a constraint-driven I/O planning and placement problem, and solved it by a multi-step algorithm based on integer linear programming. Experiment results using real industry designs showed that the proposed algorithm can effectively find a large scale I/O placement solution and satisfy all given design constraints in less than 10 minutes. This work was presented in [T5,C80].
Substrate Routing for System in Package: Off-chip substrate routing for system in package has an ultra high routing density and often applies non-Manhattan routing. Although there maybe multiple routing layers in substrate routing, planar routing is required most time because vertical detour usually is not allowed due to signal integrity requirement in high-speed IO signaling and also because a large number of vias are allocated for power routing. Compared with on-chip routers, the existing commercial tools have a much lower routability and often result in a large number of un-routed nets for manual routing. Therefore, substrate routing is in the critical path for time to market. Applying dynamic pushing to alleviate the net ordering problem, and flexible via-staggering and diffusion to reduce congestion, we have developed an effective yet efficient substrate routing algorithm. Compared with a best known method (BKM) in a current industrial tool, experiments using industrial examples show that the BKM leaves 480 nets un-routed for ten industrial designs with a total of 6415 nets, but our algorithm reduces the un-routed nets to 104, a 4.6x net number reduction and practically more design time reduction. This work was presented in [C117] [J47]. Recently, we further reduce the unrounted net by another 4X and runtime by 94X, using a diffusion-driven congestion reduction method. [C126].(iii) Off-chip power integrity
Noise driven in-package decoupling capacitance insertion: The existing decoupling capacitance optimization approaches meet constraints on input impedance for package. However, using impedance as constraints leads to large overdesign. We therefore develop a noise driven optimization for decoupling capacitors in packages for power integrity. To solve the worst case noise in the power delivery system, our algorithm uses the simulated annealing algorithm to minimize the total cost of decoupling capacitors under the constraints of a worst case noise. The key enabler for efficient optimization is an incremental worst-case noise computation based on FFT over incremental impedance matrix evaluation. Compared with the existing impedance based approaches, our algorithm reduces the decoupling capacitor cost by 3x and is also more than 10x faster even with explicit noise computation. This work was presented in [C84]
Off-chip Decoupling Capacitor Allocation for Chip Package Co-Design: Off-chip decoupling capacitor (decap) allocation is a demanding task during package and chip codesign. Existing approaches can not handle large numbers of I/O counts and large numbers of legal decap positions. This work proposes a fast decoupling capacitor allocation method. By applying a spectral clustering, a small amount of principal I/Os can be found. Accordingly, the large power supply network is partitioned into several blocks each with only one principal I/O. This enables a localized macromodeling for each block by a triangular-structured reduction. In addition, to systemically consider a large legal position map in a manageable fashion, the map of legal positions is decomposed into multiple rings, which are further parameterized in each block. The decaps are then allocated according to the sensitivity obtained from the parameterized macro- model for each block. Compared with the PRIMA-based macromodeling, experiments show that our method (TBS2) is 25x faster and has 3.04x smaller error. Moreover, our decap allocation reduces the optimization time by 97x, and reduces decap cost by up to 16% to meet the same power-integrity target. This work was presented in [C104]
Efficient Decoupling Capacitance Budgeting Considering Operation and Process Variations: This paper solves the variation-aware on-chip decoupling capacitance (decap) budgeting problem. Unlike previous work assuming the worst-case current load, we develop a novel stochastic current model, which efficiently and accurately captures operation variation such as temporal correlation between clock cycles and logic-induced correlation between ports. The models also considers current variation due to process variation with spatial correlation. We then propose an iterative alternative programming algorithm to solve the decap budgeting problem under the stochastic current model. Experiments using industrial examples show that compared with the baseline model which assumes maximum currents at all ports and under the same decap area constraint, the model considering temporal correlation reduces the noise by up to 5x, and the model considering both temporal and logic-induced correlations reduces the noise by up to 17x. Compared with the model using deterministic process parameters, considering process variation (Leff variation in this paper) reduces the mean noise by up to 4x and the 3delta noise by up to 13x. While the existing stochastic optimization has been used mainly for process variation purpose, this paper to the best of our knowledge is the first in-depth study on stochastic optimization taking into account both operation and process variations for power network design. We convincingly show that considering operation variation is highly beneficial for power integrity optimization and this should be researched for optimizing signal and thermal integrity as well. This work was presented in [C108]
Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Power delivery network (PDN) is a distributed RLC network with its dominant resonance frequency in the low-to-middle frequency range. Though high-performance chips working frequencies are much higher than this resonance frequency in general, chip runtime loading frequency is not. When a chip executes a chunk of instructions repeatedly, the induced current load may have harmonic components close to this resonance frequency, causing excessive power integrity degradation. Existing PDN design solutions are, however, mainly targeted at reducing high-frequency noise and not effective to suppress such resonance noise. In this work, we propose a novel approach to proactively suppress this type of noise. A method based on a high dimension generalized Markov process is developed to predict current load variation. Based on such prediction, a clock frequency actuator design is proposed to proactively select an optimal clock frequency to suppress the resonance. To the best of our knowledge, this is the first in-depth study on proactively reducing runtime instruction execution induced PDN resonance noise. This work was presented in [C124](iv) Thermal and power integrity for 3D integration
Temperature and Supply Voltage Aware Performance and Power Modeling at Microarchitecture Level: Performance and power are two primary design issues for systems ranging from server computers to handhelds. Performance is affected by both temperature and supply voltage because of the temperature and voltage dependence of circuit delay. Furthermore, as semiconductor technology scales down, leakage power's exponential dependence on temperature and supply voltage becomes significant. Therefore, future design studies call for temperature and voltage aware performance and power modeling. This work studies microarchitecture-level temperature and voltage aware performance and power modeling. We present a leakage power model with temperature and voltage scaling, and show that leakage and total energy vary by 38% and 24%, respectively, between 65 degrees centigrade and 110 degrees centigrade. We study thermal runaway induced by the interdependence between temperature and leakage power, and demonstrate that without temperature-aware modeling, underestimation of leakage power may lead to the failure of thermal controls, and overestimation of leakage power may result in excessive performance penalties of up to 5.24%. All of these studies underscore the necessity of temperature- aware power modeling. Furthermore, we study optimal voltage scaling for best performance with dynamic power and thermal management under different packaging options.We show that dynamic power and thermal management allows designs to target at the common-case thermal scenario among benchmarks and improves performance by 6.59% compared to designs targeted at the worst case thermal scenario without dynamic power and thermal management. Additionally, the optimal for the best performance may not be the largest allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly. [J15]
Thermal Via Allocation for 3D ICs Considering Temporally and Spatially Variant Thermal Power: All existing methods for thermal-via allocation are based on a steady-state thermal analysis and may lead to excessive number of thermal vias. This paper develops an accurate and efficient thermal-via allocation considering temporally and spatially variant thermal-power. The transient temperature is calculated using macromodel by a structured and parameterized model reduction, which generates temperature sensitivity with respect to thermal- via density. By defining a thermal-violation integral based on the transient temperature, a nonlinear optimization problem is formulated to allocate thermal-vias and minimize thermal violation integral. This optimization problem is transformed into a sequence of subproblems by Lagrangian relaxation, and each sub- problem is solved by quadratic programming using sensitives from the macromodel. Experiments show that compared to the existing method using steady-state thermal analysis, our method is 126x faster to obtain the temperature profile, and reduces the number of thermal vias by 2.04x under the same temperature bound. [C90]
Simultaneous Power and Thermal Integrity Driven Via Stapling in 3D ICs: The existing work on via-stapling in 3D integrated circuits optimizes power and thermal integrity separately and uses steady- state thermal analysis. This paper presents the first in-depth study on simultaneous power and thermal integrity driven via- stapling in 3D design. The transient temperature and supply voltage violations are calculated by a structured and parameterized model reduction, which also generates parameterized temperature and voltage violation sensitivities with respect to the via pattern and density. Using parameterized sensitivities, an efficient yet effective greedy optimization is presented to optimize power and thermal integrity simultaneously. Experiments with two active device layers show that compared with sequential power and thermal optimization using steady-state thermal analysis, sequential optimization using transient thermal analysis reduces non-signal vias by on average 11.5%, and simultaneous optimization using transient thermal analysis reduces non-signal vias by on average 34%. The via reduction would be higher for the 3D design with more device layers. [C93]
Temperature Aware Microprocessor Floorplanning Considering Application Dependent Power Load: This work studies microprocessor floorplanning considering thermal and throughput optimization. We first develop a stochastic heat diffusion model taking into account the application dependent power load for thermal analysis. Then, we design the floorplanning algorithm based on this model. Experimental results show that, compared with the deterministic heat diffusion model, our model obtains up to 3.2 degree centigrade reduction of the on-chip peak temperature, 1.25% reduction of the area, and 1.125x better CPI (cycles per instruction) performance, respectively. Compared with temperature aware floorplanning in the HOTSPOT tool set that ignores interconnect pipelining, our algorithm is up to 27x faster, reduces the peak temperature by up to 3 degree centigrade, and also reduces CPI significantly with a negligible area overhead. [C112]