Allocating Power Ground Vias in 3D ICs for Simultaneous Power and Thermal Integrity

HAO YU
Berkeley Design Automation

and

JOANNA HO and LEI HE
University of California, Los Angeles

The existing work on via allocation in 3D ICs ignores power/ground vias’ ability to simultaneously reduce voltage bounce and remove heat. This paper develops the first in-depth study on the allocation of power/ground vias in 3D ICs with simultaneous consideration of power and thermal integrity. By identifying principal ports and parameters, effective electrical and thermal macromodels are employed to provide dynamic power and thermal integrity as well as sensitivity with respect to via density. With the use of sensitivity, an efficient via allocation simultaneously driven by power and thermal integrity is developed. Experiments show that compared to sequential power and thermal optimization using static integrity, sequential optimization using the dynamic integrity reduces non-signal vias by up to 18%, and simultaneous optimization using dynamic integrity further reduces non-signal vias by up to 45.5%.

Categories and Subject Descriptors: B.7.2 [Hardware]: Integrated circuits – Design aids

Additional Key Words and Phrases: Thermal and Power Integrity, Macromodeling, Parametric 3D-IC Design

1. INTRODUCTION

In today’s two dimensional (2D) Systems-on-Chip (SoC) integration, the memory arrays are surrounded by logic circuits to address bits and to perform logic functions. Its performance thereby is limited by the long interconnect length. Thanks to advances in recent process technology, the introduction of the three dimensional (3D) integration [Banerjee et al. 2001; Goplen and Sapatnekar 2003; Das 2004; Cong et al. 2004; Davis and et al. 2005; Knickerbocker and et. al. 2005; Lim 2005; Goplen and Sapatnekar 2005; Cong and Zhang. 2005; Sapatnekar 2006; Li et al. 2006; Yu et al. 2006; Yu et al. 2006] enables flash, DRAM and SRAM to be placed atop logic devices and even microprocessor cores in SoCs. Compared
to the 2D integration, such an upwards arrangement of interconnect in 3D results in not only a shorter communication length but also a denser device density for tomorrow’s SoCs. Apparently, the physical (or geometrical) arrangement of layout resources in 3D would be similar to the physical design in 2D. However, as there are large numbers of devices densely packed in a number of device layers, it brings a significant burden to remove the heat and deliver the power (supply voltage) in 3D ICs.

Fig. 1 illustrates a typical 3D stacking of multiple device layers within one package. The supply voltage is delivered from the bottom power/ground planes in the package, passed by the through vias and C4 bumps and connected to the on-chip power/ground grid on active device layers. We call the through vias that deliver the supply voltage as the power/ground vias. The 3D integration, by definition, has integrated more than one layer of the active device. They draw a much larger current from the package power/ground planes than the 2D ICs. This obliviously can result in the IR drop for the horizontal on-chip power/ground grid. The surge of the injecting current can further lead to a large simultaneous switching noise (SSN) for those I/O drivers at the chip package interface. Fig. 2 shows a detailed view of how to place signal and power/ground vias through package planes. Clearly, the region to place power vias or ground vias decides the path of the returned current and the loop area for those signal nets connecting I/Os. As such, they form various loop-inductances with different sizes that have significant couplings with each other. Moreover, when the region to place power/ground vias is determined, the number of placed vias in that location, i.e., density, decides the amount of coupled current. When there are more vias in one region, its current density is lower and its self-inductance is smaller. This leads to a smaller SSN at I/Os. We call the voltage bounce at I/Os power integrity in this paper. To satisfy the power-integrity constraints for I/Os, the allocation of power/ground vias have been studied in [Zhao et al. 1998; Hong and et. al. 2005; Ryu and et. al. 2005]. Compared to the allocation of off-chip decoupling capacitors (mm$^2$), placing vias (um$^2$) have a smaller cost of area. This is important for high-density integration in 3D ICs as there may be a large number of signal nets to be delivered from the package to the chip. Therefore, a thorough study of power/ground vias is needed for 3D ICs.

Due to the increased power density and the slow heat-convection at inter-layer dielectrics, the heat dissipation is another concern in 3D ICs [Banerjee et al. 2001]. The excessively high temperature can significantly degrade the reliability and performance of interconnects and devices [Teng et al. 1997; Chiang et al. 2001; Wang and Chen 2003; Li et al. 2004; Zhan and Sapatnekar 2007]. We call the temperature gradient at active device layers thermal integrity in this paper. As shown in Fig. 1, a heat-sink is placed on the top of device layers and it is the primary heat-removal path to the ambient air. Note that through vias deliver supply voltages or signals from the bottom package through each active device layer. Since metal vias are good thermal conductors, the through vias can provide additional heat-removal paths passing the inter-layer dielectrics to the top heat-sink. This leads to the concept of adding dummy thermal vias or thermal vias directly inside chips [Chiang et al. 2001] to reduce effective thermal resistances. Its physical arrangement has been studied further [Goplen and Sapatnekar 2003; Cong et al. 2004; Goplen and
The dummy vias namely need to occupy the routing resources for signal nets and power/ground nets. Note that the use of power/ground vias is a must to deliver supply voltage for those active device layers from the package. They also provide good thermal conduction paths when extended to connect the heat-sink. This motivates us to reuse the power/ground vias for the heat removal in 3D ICs. The reusing of power/ground vias as thermal vias would avoid an over-design that separately allocates power/ground vias and thermal vias. Though there could be other known design freedoms such as decaps, a need is required to study the optimization of power and of thermal integrity simultaneously because power/ground vias happen to be useful to optimize both power and thermal integrity in 3D ICs.

As such, an allocation of power/ground vias simultaneously driven by power and thermal integrity can be defined for 3D ICs. Similar to [Goplen and Sapatnekar 2005], the allocation of power/ground vias in this paper is assumed after the placement and global routing and before the detailed routing for signal nets. The active device layers and power/ground layers are uniformly discretized into a number of tiles. The input ports are defined for those tiles where to inject input signals. The output ports are defined for those tiles where to probe the integrity constraint. The output ports are similar to the pre-designate regions in [Goplen and Sapatnekar 2005], all aligned for adding and adjusting vias. The tracks are defined accordingly for the vertical space occupied by the vertical through vias going from the bottom package to the top heat-sink. The number of tracks for power/ground vias, i.e., the via density, is determined by both integrity constraints as well as the signal-net congestion in the same track.

However, simultaneously considering power and thermal integrity is obviously a challenging task. Firstly, in the modern VLSI designs, dynamic power management such as clock-gating and uncertainty from the workload can lead to time-varying power inputs. This results in a spatially and temporally variant thermal model. The temperature gradient can have either a sharp-transition with a large peak value, or a time-accumulated impact to the device reliability. In addition, different regions can reach their worst-case temperature at different times. A dynamic thermal-integrity constraint thereby is needed to accurately guide the physical level resource allocation. A dynamic thermal model with the time-varying thermal power input is studied in [Teng et al. 1997; Chiang et al. 2001; Wang and Chen 2003; Li et al. 2004; Yu et al. 2006; Yu et al. 2006; Zhan and Sapatnekar 2007]. Compared to the use of the steady-state thermal model [Goplen and Sapatnekar 2003; Cong et al. 2004; Goplen and Sapatnekar 2005; Cong and Zhang. 2005; Li et al. 2006], the dynamic thermal model not only accurately monitors the temperature gradient but also avoids the over-design caused by the pessimistic static thermal-integrity constraint. Note that the dynamic power-integrity has already been employed in many on-chip or off-chip power integrity verifications and designs [Zhao et al. 2002; Zhao et al. 2002; Su et al. 2003; Li and et. al. 2005; Yu et al. 2006; Yu et al. 2007].

Next, an actuate electrical or thermal model considering all device layers and package planes can result in millions of unknowns with thousands of time-varying input sources. Moreover, during the physical design, the design parameters such
as the device size (or density) and location further increase the complexity. This makes the design for power and thermal integrity difficult to accomplish in a reasonable time and hence further blocks the physical level optimization. To reduce the simulation cost, the macromodel based approaches [Zhao et al. 2002; Su et al. 2003; Zheng et al. 2003; Yu et al. 2006; Yu et al. 2006; Yu et al. 2006; Yu et al. 2007] are used to find a compact representation of the electrical and thermal model. The works in [Yu et al. 2006; Yu et al. 2006] further introduce a structured and parameterized macromodel to provide both the nominal response and the sensitivity for the purpose of optimization.

The macromodeling based allocations in [Yu et al. 2006; Yu et al. 2006], however, are not effective when the number of input ports or output ports is large. Generally, there can be thousands of thermal-power sources injected at each active layers or hundreds of switching-current sources injected I/Os. The size of the macromodel increases with the number of ports, and hence the computational cost to solve the macromodel is still high. Moreover, the effectiveness to apply the macromodel also heavily depends on how to scope with the design parameters. Here the parameters are the via densities at tracks. Assuming that the locations of tracks are pre-designated [Goplen and Sapatnekar 2005], [Yu et al. 2006] adjusts the via density in each track. [Yu et al. 2006] further decomposes those tracks into multiple groups described by levels. The vias are then allocated uniformly for those tracks in the same level. However, it is still expensive to probe the integrity and adjust the via density at each track for a large number of pre-designated tracks.

In this paper, we introduce an allocation of power/ground vias to simultaneously consider the dynamic power and thermal integrity in 3D ICs. The primary contributions are two-fold. Firstly, we notice that the previous thermal via allo-
Fig. 2. The power delivery by vertical power/ground vias and its impact to inductive current loops.

cations [Cong et al. 2004; Goplen and Sapatnekar 2005; Cong and Zhang. 2005; Li et al. 2006; Yu et al. 2006] assume adding dummy vias to conduct heat. They ignore the fact that power/ground vias can help heat-removal as well. Therefore, the reusing of the power/ground via as thermal via can save the routing resource for signal nets. In this paper, we present a design methodology that reuses the power/ground vias as the heat-removal paths. The allocation of power/ground vias are oriented to minimize not only the dynamic power integrity, i.e., the voltage bounce for those I/Os at package and chip interface, but also the thermal integrity, i.e., the temperature gradient at those active device layers.

Secondly, we develop an effectively parameterized macromodel. The state variables such as voltages and temperatures, the input ports and output ports, and the parameters are all compressed by their dominant subspace. By taking ‘snapshots’ of input and output ports, we extract the correlations for input and output ports, respectively. The subspaces are then extracted from the correlations and are applied to compress input and output ports. Moreover, recall that those outputs are the places to probe integrity and add vias. We introduce a spectral clustering to further compress the parameters by the extracted subspace from the output port correlation. As a result, we build a system with a small number of principal ports, and parametrize it by a small number of principal parameters. By further employing a structure and parameterized macromodel, the voltages, temperatures and their sensitivities are further compressed. The macromodel is embedded in the sensitivity based allocation, where the via density is adjusted according to the sensitivities. Compared to the works in [Yu et al. 2006; Yu et al. 2006], the method developed in this paper improves the effectiveness of the macromodeling by identifying the principal ports and parameters.

The use of dummy vias in [Cong et al. 2004; Goplen and Sapatnekar 2005; Cong and Zhang. 2005; Li et al. 2006; Yu et al. 2006] separates the allocation of power/ground vias from the thermal vias. We call the separated allocation of thermal vias and power/ground vias as the sequential optimization, and call our allocation of power/ground vias for both power and thermal integrity as the simultaneous optimization. Moreover, the steady-state analysis is employed to calculate
a static integrity [Cong et al. 2004; Goplen and Sapatnekar 2005; Cong and Zhang.
2005; Li et al. 2006]. We use the sequential optimization with the static integrity
as the baseline, in comparison to the sequential optimization with the dynamic
integrity [Yu et al. 2006] and the simultaneous optimization with the dynamic in-
tegrity proposed in this paper. As shown by experiments, for the sequential integrity
optimization, the use of dynamic integrity reduces up to 18% allocated non-signal
vias compared to the use of static integrity. Furthermore, our simultaneous opti-
mization using dynamic integrity reduces up to 44.5% non-signal vias compared to
the baseline.

The rest of the paper is organized as follows. We review the background of
the dynamic electrical and thermal model and their macromodel by model order
reduction in Section 2. We formulate the allocation problem that simultaneously
considers the power and thermal integrity in Section 3. We introduce the principal
ports characterization in Section 4, and present the via-allocation algorithm using
macromodels in Section 5. We present experiment results in Section 6 and conclude
in Section 7.

2. BACKGROUND OF ELECTRO- THERMAL MODEL

2.1 Circuit Model for Dynamic Integrity

Since the analysis of thermal integrity is targeted during physical design, the gran-
ularity of discretized thermal model is thereby smaller than those for the micro-
architecture design [Liao et al. 2005]. As such, a distributed thermal-RC circuit
is employed for active device layers and dielectric layers. Each silicon or dielectric
layer is uniformly discretized by the finite difference method. As shown in Fig. 3
(a) and (b), each discretized tile can be represented by an RC-cell. The thermal
vias and global signal nets are modeled by the discretized RC as well. The following
boundary condition is added. The packaging (C4-bump package) and heat-sink are
modeled by a simple 1D convection resistor connected to the thermal ground, i.e.,
the ambient air with a fixed temperature (See Fig. 3 (d)). In addition, the inputs
are the time-varying thermal power [Tiwari et al. 1998; Liao et al. 2005] defined by
the running-average of the cycle-accurate (often in the range of ns) power over sev-
eral thermal time constants (often in the range of ms), and injected at input ports
of each layer. As such, a temporally and spatially variant temperature at output
ports can be considered by defining an integrity integral with respect to time and
space [Yu et al. 2006].

Since the active device layer at the bottom (See. Fig. 1) has the longest path to
the heat-sink on the top, in this paper, a dynamic thermal integrity is defined as the
integrated temperature fluctuation at \( p_o \) output ports on the bottom device layer.
As shown in [Teng et al. 1997; Chiang et al. 2001; Wang and Chen 2003; Li et al.
2004; Yu et al. 2006; Yu et al. 2006; Zhan and Sapatnekar 2007], a dynamic thermal
integrity can accurately capture not only the sharp-transition of temperature change
due to the dynamic power management, but also the time-accumulated temperature
impact that can affect the device reliability. Similar to the static thermal-integrity
analysis, the dynamic thermal integrity assumes the worst-case input from a limited
number of thermal-power inputs. However, since the dynamic integrity has a more
accurate transient temperature profile, it leads to a smaller allocation compared to
the static thermal-integrity based design [Goplen and Sapatnekar 2003; Cong et al. 2004; Goplen and Sapatnekar 2005; Cong and Zhang. 2005; Li et al. 2006].

Moreover, the dynamic power integrity [Zhao et al. 2002; Su et al. 2003; Li and et. al. 2005; Yu et al. 2006; Yu et al. 2007] in this paper is defined as the time-integrated voltage bounce at power/ground I/Os, which are located on the interface between the bottom device layer and the package. The discretized tile of a power/ground plane, power/ground via, and signal via in the package are modeled by an $RLC$-cell (See Fig. 3 (a) and (c)) under the PEEC model [Ruehli 1974]. The C4-bump, on-chip global signal nets and power grid are modeled as $RC$ circuits. The inputs are the switching-current for those input signals in the package. In addition, the power/ground vias in this paper are designed to provide power supply, alter voltage bounce, and remove heat from the package planes to the heat-sink. To avoid the difference in I/R drop, the vertical power/ground vias are aligned along the vertical dimension. As a result, the power/ground vias reused for heat-removal are all aligned from the bottom package to the top heat-sink.

The extracted thermal-$RC$ and electrical-$RLC$ models can be described in the state equation under modified nodal analysis (MNA). Since the distributed thermal-$RC$ circuit is a simpler case for the electrical-$RLC$ circuit, we use the presentation of the $RLC$ circuit unless stated otherwise. The MNA for the nominal electrical-$RLC$ circuit in time-domain without power/ground vias is

$$\begin{align*}
Gx(t) + C \frac{dx(t)}{dt} &= BI(t) \\
y(t) &= L^T x(t)
\end{align*}$$

Table I. Notation list for system equation (2). Note that $N = N_v + N_l$.

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$N$</td>
<td>Dimension of MNA state equation</td>
</tr>
<tr>
<td>$N_v$</td>
<td>Dimension of voltage variables in MNA</td>
</tr>
<tr>
<td>$N_l$</td>
<td>Dimension of current variables in MNA</td>
</tr>
<tr>
<td>$p_i$</td>
<td>Dimension of input-ports</td>
</tr>
<tr>
<td>$p_o$</td>
<td>Dimension of output-ports</td>
</tr>
<tr>
<td>$x(y)$ ($\in \mathbb{R}^{N \times 1}$)</td>
<td>State variable (at output)</td>
</tr>
<tr>
<td>$v_n$ ($\in \mathbb{R}^{N_v \times 1}$)</td>
<td>Nodal voltage variables</td>
</tr>
<tr>
<td>$i_l$ ($\in \mathbb{R}^{N_l \times 1}$)</td>
<td>Inductive-branch current variables</td>
</tr>
<tr>
<td>$G$ ($\in \mathbb{R}^{N_v \times N_v}$)</td>
<td>Nominal conductance matrix</td>
</tr>
<tr>
<td>$C$ ($\in \mathbb{R}^{N_v \times N_v}$)</td>
<td>Nominal capacitance matrix</td>
</tr>
<tr>
<td>$L$ ($\in \mathbb{R}^{N_l \times N_l}$)</td>
<td>Nominal inductance matrix</td>
</tr>
<tr>
<td>$E_i$ ($\in \mathbb{R}^{N_v \times p_i}$)</td>
<td>Input/output incident matrix</td>
</tr>
<tr>
<td>$E_l$ ($\in \mathbb{R}^{N_v \times N_l}$)</td>
<td>Inductive incident matrices</td>
</tr>
<tr>
<td>$X_i$ ($\in \mathbb{R}^{N \times N}$)</td>
<td>Parameterized topological matrix for</td>
</tr>
</tbody>
</table>

or in frequency ($s$) domain

$$[G + sC]x(s) = B I(s)$$

$$y(s) = L^T x(s)$$

(2)

where

$$x(s) = \begin{bmatrix} v_n \\ i_l \end{bmatrix}, B = \begin{bmatrix} E_i \\ 0 \end{bmatrix}.$$  

and

$$G = \begin{bmatrix} G & E_i \\ -E_i^T & 0 \end{bmatrix}, C = \begin{bmatrix} C & 0 \\ 0 & L \end{bmatrix}.$$  

(3)

Note that $B$ is the topology matrix to describe $p_i$ input ports with injected input sources, and $L$ is the one to describe $p_o$ output ports for probing integrity and adjusting via density. All notations in (3) are summarized in Table 1.

Differences exist between thermal-RC and electrical-RLC circuits in MNA. For a thermal-RC circuit, (3) becomes

$$G = G, \quad C = C$$

(4)

where $G$ and $C$ have larger $RC$ values and result in a larger time-constant (in the scale of ms) than an electrical-RLC circuit does (in the scale of μs). Moreover, the input $I(s)$ for the thermal-RC circuit stands for the thermal-power. Recall that $I(s)$ stands for switching-current for electrical-RLC circuits. In addition, output $y$ at the selected nodes is temperature $T$ and voltage $V$ for the thermal-RC and electrical-RLC circuit, respectively.

2.2 Macromodeling by Model Order Reduction

As the layouts of active device layers and power/ground planes are discretized sufficiently, the size of resulting state-matrices thereby can be huge. As such, the distributed thermal-RC and electrical-RLC model are difficult to be employed for either the integrity verification or the optimization.

Model order reduction [E.J. Grimme 1997; Odabasioglu et al. 1998] finds the dominant state variables and obtains compact macromodels. As shown in [E.J. Grimme...
The dominant state variables are related to the block Krylov subspace

\[ \mathcal{K}(A, R) = \{ A, AR, \cdots A^{q-1}R, \cdots \}, \]

constructed from moment matrices

\[ A = (G + s_0C)^{-1}C, \quad R = (G + s_0C)^{-1}B \]

by expanding the system transfer function

\[ H(s) = L^T(G + sC)^{-1}B \]

at one frequency \( s_0 \).

By applying the block Arnoldi iteration [Odabasioglu et al. 1998], a small dimensioned projection matrix \( Q \) (\( N \times q \times p_i \)) can be found to contain \( q \)-th order block Krylov subspace

\[ \mathcal{K}(A, R, q) \subseteq Q. \]

Using \( Q \) the original system can be reduced by projection

\[ \hat{G} = QTGQ \quad \hat{C} = QTCDQ \quad \hat{B} = QTBDQ \quad \hat{L} = QTLD. \]

Accordingly, the reduced system transfer function becomes

\[ \hat{H}(s) = \hat{L}^T(\hat{G} + s\hat{C})^{-1}\hat{B}. \]

As proven in [E.J.Grimme 1997; Odabasioglu et al. 1998], the reduced \( \hat{H} \) approximates the original system transfer function \( H(s) \) by matching the first \( q \) block moments expanded at the frequency-point \( s_0 \). This procedure can be applied to generate compact macromodels for both thermal-RC and electrical-RLC circuits.

Our thermal-RC and electrical RLC models have multiple inputs and multiple outputs (MIMO). This brings challenges for the projection based model order reduction. The dimension of the reduced MIMO system \( \hat{H} \) (\( \in \mathbb{R}^{p_o \times p_i} \)) depends on the input-port number \( p_i \) and \( p_o \). In general, there are large numbers of injecting inputs. An accurate monitoring of integrity also needs large numbers of pre-designate regions to probe outputs. Therefore, an effective macromodel needs to further compress the number of ports when \( p_i \) and \( p_o \) are both large. In Section 4 we identify a much smaller number of principal input and output ports by studying the correlation.

3. PROBLEM FORMULATION

3.1 Levelized Via Tracks

Same as [Goplen and Sapatnekar 2005], this paper assumes that the via allocation is after the placement and global routing but before the detailed routing of the signal nets. The active device layers and power/ground planes are first discretized into titles according to the complexity of the circuit. The inputs of thermal power or switching currents are injected at tiles called input ports. The outputs of temperature gradient and voltage bounce are probed at tiles called output ports. Recall that there are \( p_o \) output ports for the electrical-RLC model probed at power/ground I/Os connecting the power supply between the package and the bottom device layer.
Moreover, as shown in Fig. 1, because the bottom device layer has the longest path to the heat-sink on the top, the \( p_o \) output ports for the thermal-\( RC \) model are probed at the bottom device layer in this paper.

The power ground vias are placed at centers of tiles between two layers, and follow an aligned path from the bottom package I/Os to the top heat-sink. We call those aligned paths vertical tracks or tracks. As vias are aligned, the \( p_o \) tracks pass both \( p_o \) output ports of the electrical-\( RLC \) model and \( p_o \) output ports of the thermal-\( RC \) model. The density of power/ground vias at each track is the primary design parameter considered in this paper. The density adjusts to satisfy two requirements at output ports. The first is the integrity constraint of the temperature gradient and voltage bounce. The second is the resource constraint with provided signal net congestion.

Devising a method to decide the number of output ports or tracks is a challenging task. On the one hand, if the number of probed output ports is too small, the design would miss those hotspots or violate the signal routing. On the other hand, it is impossible to explore the via density at each track when the number of probed output ports is large. To systematically allocate via, in this paper, we hierarchically decompose the solution space by levels:

**Definition 1.** Assuming each layer is discretized into \( N \) tiles, a level-\( i \) allocation is to hierarchically and symmetrically select \( 4^i \) tiles located in the center of one subdivision at one layer, and allocate vias aligned in one track between centers of tiles in adjacent device layers.

The level \( i \) is provided by users according to the complexity of the active circuit on one layout. The levelized via-allocation patterns are shown in Fig. 4. A level-0 pattern means allocating vias in the center tile, and a level-1 pattern means allocating vias in each of the 4 partitions. Note that the locations of tiles in one level is uniformly distributed, but the via density varies at different tiles. Our design freedom is the via-density (number of vias in one tile), which is adjusted according to the sensitivity discussed in Section 5.

An \( ith \)-level output-port matrix \( L \) in (2) is

\[
L = [l_1, l_2, \cdots, l_{p_o}], \quad p_o = 4^i.
\]

(5)

Each column \( l_j \) \((j = 1, \ldots, p_o)\) has a non-zero entry ‘1’ at \( jth \) output port. Obviously, for both the original model and the reduced model, directly probing either power or thermal integrity and further adjusting the via density at all \( 4^i \) output ports would be computationally expensive. In Section 4, we resolve this by compressing output ports.

### 3.2 Co-Optimization of Dynamic Thermal and Power Integrity

To capture the sharp-transition of temperature change as well as the time-accumulated temperature impact, we employ the thermal-integrity integral used in [Yu et al. 2006] as the measure of dynamic thermal integrity at \( jth \) \((j = 1, \ldots, p_o)\) output port:

\[
f_j^T = \int_{t_0}^{t_p} \max[y_j(t), Tc]dt = \int_{t_e}^{t_s} [y_j(t) - T_r]dt,
\]

(6)

with a pulse-width \((t_s, t_e)\) in a sufficient long time-period \(t_p\) all in the scale of thermal-constant (ms). \(y_j(t)\) is the transient temperature waveform at \(j\)th output port, and \(T_r\) is the reference temperature.

To further consider the spatial difference of \(p_o\) output ports at the bottom device layer, the overall thermal integrity is defined by a normalized summation:

\[
f_T = \frac{\sum_{j=1}^{p_o} f_{T_j}}{t_p \cdot p_o}
\]

(7)

Such a measure of thermal integrity takes into account both the temporal and spatial variation of temperature. Similarly, the power-integrity integral \(f_V\) is defined at \(j\)th power/ground I/O with reference voltages Vdd and ground, and integrated at the period \(t_p^V\) in the scale of electrical-constant (ns). The overall power integrity \(f_V\) is defined similarly to \(f_T^V\) for \(p_o\) power/ground I/Os with the reference voltage \(V_r\) (0 for ground vias and Vdd for power vias).

Accordingly, we have the following problem formulation:

**Formulation 1.** Given the targeted voltage bounce \(V_t\) for \(p_o\) output ports at power/ground I/Os, and the targeted temperature gradient \(T_t\) for \(p_o\) output ports at bottom device layer, the via-allocation problem is to minimize the total via number, such that the temperature gradient \(f_T^V\) is smaller than \(T_t\) and the voltage bounce \(f_V^T\) is smaller than \(V_t\).

Such a via-allocation problem simultaneously driven by power and thermal integrity can be represented by

\[
\min \sum_{j=1}^{p_o} n_j \\
\text{s.t. } f_V^T \leq V_t, \ f_T^V \leq T_t \\
\text{and } n_{\text{min}} \leq n_j \leq n_{\text{max}}
\]

(8)

Note that \(n_j\) is the via density at \(j\)th track and \(V_t\) and \(T_t\) are the targeted voltage bounce and temperature gradient. As discussed in Section 5.2, \(n_j\) is decided according to the power and thermal sensitivities obtained from the macromodel. As our power/ground vias are allocated after the placement and global routing of signal nets at each active device layer, the densities of those inter-layer signal nets are available to calculate a maximum density \(n_{\text{max}}\) for the power/ground vias. In addition, for the sake of the reliability concern for the large current, the via density \(n_j\) on the other hand can not be smaller than a minimum density \(n_{\text{min}}\). These parameters \((n_{\text{max}}, n_{\text{min}}, V_t, T_t)\) can be estimated and provided by users.

As presented in Section 5, the key to solving (8) is an efficient yet accurate dynamic analysis to evaluate \(V(t)\) and \(T(t)\) as well as their sensitivities with respect to \(p_o\) tracks, calculated at two different time scales, respectively. After obtaining sensitivities for \(p_o\) tracks, we decide the via density \(n_j\) in a greedy fashion for each level proportional to the sensitivity.

4. PRINCIPAL INPUT AND OUTPUT IDENTIFICATION

As discussed in Section 2, the effectiveness of macromodels diminishes when there are a large number of inputs and outputs. In this paper, we reduce the complexity of the problem by identifying the principal inputs and outputs.
from ports by studying the correlation of input signals and the correlation of output signals. The previous works [Feldmann and Liu 2004; Liu et al. 2005; Li and Shi 2006], however, extract the correlation by directly studying the topology matrix $B$ and $L$. As the correlation physically relates to the input and output signals, this paper extracts the correlation by taking samples of ‘snapshots’ at inputs and outputs. With the use of proper-orthogonal-decomposition (POD) or principal-component-analysis (PCA) [Moore 1981; Box et al. 1994; Antoulas et al. 2001; Rathinam and Petzold 2003; Astrid et al. 2007; Ding and Zha 2008], a large number of ports are compressed into a small number of principal ports.

4.1 Correlation Extraction

Since the electrical signals may share the same clock and operate within a similar logic function, their waveforms in time-domain at certain input ports can show a correlation. Similarly, the thermal power may differ significantly between those regions with and without the clock gating, but can be quite similar inside the region with the same mode as inputs have similar duty-cycles over time. We call this phenomenon input similarity.

As the input vector

$$I(t) = \begin{bmatrix} I_1(t_0) & \cdots & I_1(t_N) \\ \vdots & \ddots & \vdots \\ I_{p_i}(t_0) & \cdots & I_{p_i}(t_N) \end{bmatrix} \in \mathbb{R}^{p_i \times 1}. \tag{9}$$

is usually known during the physical design, they can be represented by taking a set of ‘snapshots’ sampled at $N$ time-points

$$\begin{bmatrix} I_1(t_0) & \cdots & I_1(t_N) \\ \vdots & \ddots & \vdots \\ I_{p_i}(t_0) & \cdots & I_{p_i}(t_N) \end{bmatrix} \tag{10}$$

in a sufficient long period $[0, T_p]$. The sampling cycle is in a different time-scale for the thermal-power (ms) and switching-current (ns).

According to the POD analysis [Antoulas et al. 2001; Rathinam and Petzold 2003; Astrid et al. 2007], the similarity can be mathematically described by a correlation matrix (or Grammian). As such, an input correlation matrix among $p_i$ input sources...
is estimated by a co-variance matrix in the affine form:

\[
\mathcal{R} = \frac{1}{N} \sum_{\alpha=1}^{N} (\mathbf{I}(t_\alpha) - \bar{\mathbf{I}})(\mathbf{I}(t_\alpha) - \bar{\mathbf{I}})^T \in \mathbb{R}^{p_i \times p_i}.
\]

(11)

The \(\bar{\mathbf{I}}\) is a vector of mean values defined by:

\[
\bar{\mathbf{I}} = \frac{1}{N} \sum_{\alpha=1}^{N} \mathbf{I}(t_\alpha)
\]

(12)

Usually, the input vector \(\mathbf{I}(t)\) is periodic and the waveform in each period can be approximated by the piecewise-linear model. (11) and (12) are then pre-stored in a table after the evaluation.

An output similarity is defined for responses at output ports and measured by a output correlation matrix. To extract the output correlation matrix that is independent on the inputs, we assume that \(p_i\) inputs in the input vector \(\mathbf{I}(s)\) are all the unit-impulse source \(h(s)\) and define an input-port vector \(\mathbf{J}(s)\) by

\[
\mathbf{J}(s) = \mathbf{B}\mathbf{I}(s), \quad \in \mathbb{R}^{1 \times N},
\]

(13)

which has \(p_i\) non-zero entries with the unit-value ‘1’. Accordingly, the \(p_o\) output responses \(y(s)\) are calculated by

\[
\mathbf{y}(s) = \mathbf{E}^T(\mathbf{G} + s\mathbf{C})^{-1}\mathbf{J}
= \begin{bmatrix}
    y_1(s) & y_2(s) & \cdots & y_{p_o}(s)
\end{bmatrix} \in \mathbb{R}^{p_o \times 1}.
\]

(14)

The according output correlation matrix is extracted in the frequency-domain. Similarly, the output signals can be represented by taking a set of ‘snapshots’ sampled at \(N\) frequency points

\[
\begin{bmatrix}
    y_1(s_0) & \cdots & y_1(s_N) \\
    \vdots & \ddots & \vdots \\
    y_{p_o}(s_0) & \cdots & y_{p_o}(s_N)
\end{bmatrix}
\]

(15)

in a sufficient wide band \([0, s_{max}]\). The \(s_{max}\) locates in a low-frequency range for the temperature and in a high-frequency range for the voltage.

A co-variance matrix is defined under the POD analysis [Antoulas et al. 2001; Rathinam and Petzold 2003; Astrid et al. 2007] in frequency-domain as follows

\[
\mathbf{R} = \sum_{\alpha=1}^{N} (\mathbf{y}(s_\alpha) - \bar{\mathbf{y}})(\mathbf{y}(s_\alpha) - \bar{\mathbf{y}})^T \in \mathbb{R}^{p_o \times p_o}
\]

(16)

to estimate the correlation matrix among \(p_o\) outputs. The \(\bar{\mathbf{y}}\) is a vector of mean values defined by:

\[
\bar{\mathbf{y}} = \frac{1}{N} \sum_{\alpha=1}^{N} \mathbf{y}(s_\alpha)
\]

(17)

Both (16) and (17) can be evaluated and pre-stored in a table as well.
4.2 Ports Compression

Let $V = [v_1, v_2, ..., v_K] (\in R^{N\times K})$ as the first $K$ singular-value vectors of the input correlation matrix $R$, and $W = [w_1, w_2, ..., w_K] (\in R^{N\times K})$ as the first $K$ singular-value vectors of the output correlation matrix $R$. All singular-value vectors are obtained from the singular-value decomposition (SVD) of $(V, W)$. The order $K$ is determined when the $K$th singular-value is below a desired threshold. A rank-$K$ matrix $P_i$ can be constructed by $P_i = VV^T$, and a rank-$K$ matrix $P_o$ can be constructed by $P_o = WW^T$. Note that both $V$ and $W$ are orthonormalized, i.e., $V = V^TV$ and $W = W^TW$. As shown in [Antoulas et al. 2001; Astrid et al. 2003; Astrid et al. 2007], the correlation matrix $(R, R)$ is essentially the solution that minimizes the least-square between the original states $(I(t), y(s))$ and their rank-$K$ approximations $(P_i \cdot I(t), P_o \cdot y(s))$.

As a result, both the input signals $I(t)$ and the output signals $y(s)$ can be approximated by an invariant (or dominant) subspace spanned by the orthonormalized columns of $V$ and $W$, respectively:

$$I = V I_K, \ y = W y_K.$$  \hspace{1cm} (18)

Note that in frequency domain, the output response $y(s)$ is obtained by

$$y(s) = L^T (G + sC)^{-1} B I(s).$$  \hspace{1cm} (19)

Based on (18) and (19), it leads to the following equivalent system equation

$$y_K(s) = L_K^T (G + sC)^{-1} B_K I_K(s),$$  \hspace{1cm} (20)

where

$$L_K^T = W^T L^T, \ B_K = BV.$$  \hspace{1cm} (21)

Therefore, both the dimensions of $L (\in R^{N\times p_o})$ and $B (\in R^{N\times p_e})$ are greatly reduced when $K << p_i$ and $p_o$. We call $I_K$ and $y_K$ principal inputs and outputs identified by principal input-port and output-port matrices $B_K$ and $L_K$, respectively.

Therefore, we obtain an equivalent system to the original system. Instead of the input/output identified by $I/y$, the new system is described by the principal input/output by $I_K/y_K$. Its transfer function is

$$H_K = L_K^T (G + sC)^{-1} B_K.$$  \hspace{1cm} (22)

The according Krylov subspace

$$\mathcal{K}(A, R_K) = \{A, A R_K, \cdots, A^{q-1} R_K, \cdots\}$$

is composed by moment matrices

$$A = (G + s_0 C)^{-1} C, \ R_K = (G + s_0 C)^{-1} B_K.$$  

Similarly, a projection matrix $Q_K$ that contains the Krylov subspace

$$\mathcal{K}(A, R_K, q) \subseteq Q_K,$$

can be obtained by the $q$th-order block-Arnoldi iteration [Odabasioglu et al. 1998]. Because of $K << p_i$, the computational cost to construct the projection matrix $Q_K$ is greatly reduced. Moreover, given the same size $n$ for reduced models, the number of matched block moments is increased from $[n/p_i]$ to $[n/K]$. Therefore,
the identification of principal input further helps build an effective projection matrix $Q_K$ to reduce the size of system matrices $\mathcal{G}$ and $\mathcal{C}$ by

$$\hat{\mathcal{G}} = Q_K^T \mathcal{G} Q_K \quad \hat{\mathcal{C}} = Q_K^T \mathcal{C} Q_K,$$

and further reduce the size of $\mathcal{L}_K$ and $\mathcal{B}_K$

$$\hat{\mathcal{B}}_K = Q_K^T \mathcal{B}_K \quad \hat{\mathcal{L}}_K^T = Q_K^T \mathcal{L}_K^T.$$

Recall that the $p_o$ outputs are not only the place we probe power and thermal integrity, but also the regions in which we plan to add and adjust the via density. The correlated output ports thereby can be clustered together and further parameterized with a common via density in the same cluster. This is illustrated with details in the following Section 5.

5. PARAMETERIZED MACROMODEL OF INTEGRITY AND SENSITIVITY

The changes at outputs, i.e., sensitivities can be similar for those correlated regions. As such, it can guide us to further group those correlated regions into a small number of ‘representative’ regions, called clusters, and further help effectively generate a parameterized macromodel with a cluster-sensitivity. Therefore, we can efficiently adjust the density of power/ground vias based on the cluster-sensitivity and optimize both the power and thermal integrity. In this Section, we employ a spectral analysis to cluster parameters, present a structured and parameterized macromodeling to provide both nominal values and their sensitivities, and discuss a via allocation algorithm while simultaneously considering power and thermal integrity.

5.1 Clustered Parameterization

The paper as far has only discussed the nominal circuit equation and state variables for active device layer, interlayer-dielectrics, power/ground planes, vertical vias for signal nets, and C4 bumps etc. In this part, we show how to consider the parameterized power/ground vias for both providing supply (return-path) and removing heat.

In Section 2, we have shown how to construct a topology matrix to probe power and thermal integrity at output ports. Output ports are defined at the bottom device layers from where the power/ground vias go up to the top heat-sink and go down to the bottom package I/Os. The power/ground vias are aligned to pass layer by layer. When adding and adjusting vias at output ports, the added vias change $RC$ values in the state matrix at different nodes. As a result, the parameterized description of vias needs to reflect the changes in the state matrices $\mathcal{G}$ and $\mathcal{C}$.

For an $ith$-level allocation with $p_o$ ($4^i$) vertical tracks, the locations to add vias are fixed and can be described by $p_o$ topological matrices:

$$X_1 = \begin{bmatrix} 1 & -1 & \cdots \end{bmatrix} \quad X_j = \begin{bmatrix} \cdots & \cdots & \cdots \end{bmatrix} \quad X_{p_o} = \begin{bmatrix} \cdots & \cdots & \cdots \end{bmatrix} \quad \text{all } \in \mathbb{R}^{N \times N}$$

(23)
Fig. 5. The topology matrices \( X_3 \) and \( X_7 \) when allocating vias between nodes 3 and 7.

Fig. 5 shows one simple example of the topological matrix \( X_3 \) and \( X_7 \) when allocating vias between tile-3 and tile-7 at two layers.

Then, the unit-conductance and capacitance matrices for \( j \)th track are:

\[
\begin{align*}
   n_j \cdot g_j &= n_j \cdot (g X_j), \\
   n_j \cdot c_j &= n_j \cdot (c X_j),
\end{align*}
\]

where \( g \) and \( c \) are conductance and capacitance for one via with unit area \( a \). Note that the via density \( n_j \) is related to the via area \( A_j \) in the tile by \( n_j = A_j / a \), where \( a \) is the unit area of via determined by the processing technology. Because conductance and capacitance values are both proportional to the area \( A \), they are implicitly proportional to the via density \( n_j \) as well.

Assuming the via density is \( n_j \) for \( j \)th track and defining a via density vector

\[
\mathbf{n} = [n_1, ..., n_po],
\]

a parameterized state equation can be obtained

\[
(G + sC + \sum_{j=1}^{po} n_j g_j + s \sum_{j=1}^{po} n_j c_j) x(n, s) = B_K I_K(s), \quad y_K(n, s) = \mathcal{L}^T_K x(n, s).
\]

(25)

Similar to [Li et al. 2005; Yu et al. 2006; Yu et al. 2006], we expand \( x(n, s) \) in Taylor series with respect to \( n_j \), and introduce a new state variable \( x_{ap} \)

\[
x_{ap} = [x^{(0)}, x^{(1)}_1, ..., x^{(1)}_{po}]^T.
\]

(26)

It contains both the nominal response \( x^{(0)} \) and its first-order sensitivities \([x^{(1)}_1, ..., x^{(1)}_{po}]\) with respect to \( po \) parameters \([n_1, ..., n_{po}]\). The overall responses is obtained by

\[
x = x^{(0)} + \sum_{j=1}^{po} x^{(1)}_j.
\]

Substituting (26) in (25), and explicitly matching expanded terms for each \( n_j \) up to the first-order, (25) can be reformulated into a parameterized system with augmented dimension

\[
(G_{ap} + sC_{ap}) x_{ap} = B_{ap} I_K(s), \quad y_{ap} = \mathcal{L}_{ap}^T x_{ap}.
\]

(27)
where the state matrices become
\[
G_{ap} = \begin{bmatrix}
G_0 & 0 & \cdots & 0 \\
n_1 g_1 & G_0 & & 0 \\
& & \ddots & \ddots \\
n_p o g_p & & & G_0
\end{bmatrix},
C_{ap} = \begin{bmatrix}
C_0 & 0 & \cdots & 0 \\
n_1 c_1 & C_0 & & 0 \\
& & \ddots & \ddots \\
n_p o c_p & & & C_0
\end{bmatrix},
\]
(28)
the input/output ports become
\[
B_{ap} = [B K, 0, \ldots, 0]^T, \quad L_{ap} = [L K, \delta n_1 L_K, \ldots, \delta n_p o L_K]^T,
\]
(29)
and the output responses become
\[
y_{ap} = [y^{(0)}_K, (y^{(1)}_K)_1, \ldots, (y^{(1)}_K)_p o]^T.
\]
(30)
Note that state matrices \( G_{ap} \) and \( C_{ap} \) constructed in this fashion both show a lower-block triangular structure. The nominal states are in the diagonal blocks and the perturbed states for different allocation-patterns are in the lower off-diagonal blocks. As such, there only factorization cost coming from diagonal blocks. In addition, the output response \( y_{ap} \) have two parts: the nominal value, \( y^{(0)}_K \), and its first-order sensitivity \( (y^{(1)}_K)_j \) with respect to each parameter \( n_j \) \((j = 1, \ldots, p_o)\). Obliviously, if we treat each \( n_j \) at \( j \)th track as an independent parameter, the system size will be become too large to analyze. Thanks for the correlation at output ports, we present a spectral clustering below to cluster those correlated tracks and assign same via density for tracks in one cluster and different via density for tracks in different clusters.

Algorithm 1: Spectral Clustering

1. Input: Cluster number \( K \), correlation matrix \( R \in R^{p_o \times p_o} \), and \( p_o \) topology matrices \( X_j \ j = 1, \ldots, p_o \)
2. Compute normalized Laplacian: \( R_L = D^{-1/2}RD^{1/2} \), where \( D = \text{diag}(R) \);
3. Compute the first \( K \) singular-value vectors \( v_1, \ldots, v_K \) of \( R_L \);
4. Let \( V = [v_1, \ldots, v_K] \in R^{N \times K} \) and \( R_C = R_L \cdot V \);
5. Add \( j \)th output port into \( k \)th cluster if \( R_C(j, k) \) is the maximum in \( j \)th row;
6. Form \( P_K \) by summing all clustered \( X_j \)s in \( k \)th cluster;
7. Output: new topology matrices \( P_k \) \((k = 1, \ldots, K)\)

Fig. 6. Algorithm 1 for spectral clustering of parameters.

The spectral analysis can efficiently cluster or partition a large-scale graph by analyzing the spectral values (eigen values or singular-values) of the graph [Alpert et al. 1999]. It is related to the POD or PCA analysis [Kannan et al. 2004; Ding and Zha 2008] if we build a correlation graph \( G(V, E) \) with \( p_o \) output ports as nodes, and their correlation as edges weighed by correlation coefficients. For output ports \( n_i \) and \( n_j \), an edge is added with the weight \( R(i, j) \). Fig. 7 shows the relation between the correlation graph and the correlation matrix.

The overall procedure of the spectral analysis is outlined in Algorithm 1. To improve the numerical stability, the spectral clustering first normalizes $R$ by its diagonal $D$ to construct the Laplacian $R_L$ of $R$. Similar to the concept of projection based model order reduction, the spectral clustering in this paper finds a proper projection matrix $V$ composed by the first $K$ singular-value vectors $[v_1, \ldots, v_K]$ of $R_L$. $V$ spans the $K$th-order subspace with the first $K$ dominant singular-value vectors or directions. By projecting the large-dimensional $R_L \in R^{p_o \times p_o}$ into the subspace spanned by $V$, $R_L$ is compressed to

$$R_C = R_L \cdot V. \quad (31)$$

Therefore, if each column vector $v_j$ is viewed as a cluster basis, data points in the correlation graph defined by $R_L$ is clustered into $K$ clusters along $K$ directions, with each direction defined by $v_j$. As such, an output port $j (j = 1, \ldots, p_o)$ is placed into a cluster $k (k = 1, \ldots, K)$ if $R_C(j, k)$ is the largest entry in the $j$th row of $R_C$. Accordingly, the $j$th track to place vias is added into $k$th cluster. Its related topology matrix $X_j$ is summed into a clustered topology matrix $P_k (k = 1, \ldots, K)$. Assuming $m$ tracks in one cluster with topology matrices $[X_{j1}, X_{j2}, \ldots, X_{jm}]$, then the new clustered topology matrix $P_k$ becomes

$$P_k = X_{j1} + X_{j2} + \ldots + X_{jm}. \quad (32)$$

In the same cluster $k$, all tracks have the same via density $n_k$. Hence only $K$ parameterized via densities need to adjust instead of $p_o$ via densities ($K << p_o$). We call those $K$ via densities principal parameters.

As a result, the parameterized system can be compressed into

$$(G_{ap} + s C_{ap}) x_{ap} = B_{ap} I_K (s), \quad y_{ap} = L_{ap}^T x_{ap}, \quad (33)$$

where the state matrices become

$$G_{ap} = \begin{bmatrix} \mathcal{G} & 0 & \ldots & 0 \\ n_1 g_1 & \mathcal{G} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ n_K g_K & 0 & \ldots & \mathcal{G} \end{bmatrix}, \quad C_{ap} = \begin{bmatrix} c & 0 & \ldots & 0 \\ n_1 c_1 & c & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ n_K c_K & 0 & \ldots & c \end{bmatrix}. \quad (34)$$

The $(B_{ap}, L_{ap}, x_{ap}, y_{ap})$ are in the same forms as $(B_{ap}, L_{ap}, x_{ap}, y_{ap})$ but with an augmented dimension compressed from $p_o$ to $K$. Note that because the dimension of nominal state matrices in the diagonal is still large, the model order reduction is needed to further reduce (33).

The main computational cost of Algorithm 1 is from the step 3, the singular-value-decomposition (SVD) of the correlation matrix. For a dense correlation matrix ($\in R^{p_o \times p_o}$) with the known-rank $K$, the complexity of SVD is $O(p_o K^2)$. Usually, the correlation matrix can be first sparsified by pruning small entries of each row. This would help reduce part of the computational cost. Moreover, as the correlation is done once and can be repeatedly used, the use of SVD is thereby an attractive approach to compress physical parameters for the design automation. In addition, when compared to the K-means clustering in [Liu et al. 2005], the spectral clustering developed in this paper determines the cluster by the subspace projection. It does not depend on an arbitrary initial condition that may lead to different clustering results caused by the K-means clustering.
Fig. 7. The parameter clustering by clustering nodes on the correlation graph built from the correlation matrix.

5.2 Macromodel based Via Allocation

A $q$th-order flat projection matrix $Q_{ap}$ can be obtained from block Arnoldi iteration to reduce (33). To preserve the separated nominal value and its sensitivity during projection, $Q_{ap}$ is first partitioned into $K + 1$ blocks

$$Q_{ap} = [Q_0, Q_1, ..., Q_K],$$ (35)

each $Q_j$ ($j = 0, 1, ..., K$) with size $N \times q \times K$. Then, similar to [Yu et al. 2006; Yu et al. 2006], $Q_{ap}$ is further structured as follows

$$Q_{ap} = \begin{bmatrix}
Q_0 \\
Q_1 \\
. . . \\
Q_K
\end{bmatrix}.$$ (36)

Each $Q_j$ ($j = 0, 1, ..., K$) is further orthonormalized with each other. As such, proven in [Yu et al. 2005; Yu et al. 2006; Yu et al. 2006] the structured reduction by $Q_{ap}$

$$\tilde{G}_{ap} = Q_{ap}^T G_{ap} Q_{ap}, \quad \tilde{C}_{ap} = Q_{ap}^T C_{ap} Q_{ap}, \quad \tilde{B}_{ap} = Q_{ap}^T B_{ap}, \quad \tilde{L}_{ap} = Q_{ap}^T L_{ap}$$ (37)

still preserves moments up to the $q$th-order.

After projected by $Q_{ap}$, the reduced macromodel in time-domain can be solved by the Backward-Euler (BE) integration with the state-equation below at time-instant $t$ and time-step $h$

$$(\tilde{G}_{ap} + \frac{1}{h} \tilde{C}_{ap})\tilde{x}_{ap}(t) = \frac{1}{h} \tilde{C}_{ap}\tilde{x}_{ap}(t - h) + \tilde{B}_{ap}I_K(t)$$

$$\tilde{y}_{ap}(t) = \tilde{L}_{ap}^T \tilde{x}_{ap}(t).$$ (38)

As shown in [Yu et al. 2005; Yu et al. 2006; Yu et al. 2006], the projection by $Q_{ap}$ preserves the block matrix structure. As a result, reduced $\tilde{G}_{ap}$ and $\tilde{C}_{ap}$ both have the same preserved lower-block triangular structure. Therefore, (38) can be efficiently solved by the block-backward-substitution.
Note that the reduced nominal response and its sensitivity are still separated at outputs:

\[ \tilde{y}_{ap} = [\tilde{y}_0^{(0)}, \tilde{y}_1^{(1)}]^T = [\tilde{y}_0^{(0)}, \tilde{y}_1^{(1)}, \ldots, \tilde{y}_K^{(1)}]^T. \]

The according overall output response \( \tilde{y} \), i.e., the integrity vector of voltage or temperature \((V/T)\) is

\[ \tilde{y}(n, t) = \tilde{y}^{(0)}(n, t) + \tilde{y}^{(1)}(n, t). \tag{39} \]

As discussed below, such a structured and parameterized macromodel can be incorporated into a sensitivity-based allocation.

### Algorithm 2: Sensitivity based Via Allocation

1. **Input**: \( K \) principal input ports, \( K \) principal output ports, \( K \) principal parameters, maximum temperature bound \( T_{\text{max}} \), maximum voltage-bounce bound \( V_{\text{max}} \), signal-net-congestion bound \( n_{\text{max}} \) and current-density bound \( n_{\text{min}} \).
2. Construct structured and parameterized macromodel by (37);
3. Compute nominal voltage \((V)\)/temperature \((T)\) and sensitivity \(S_V/S_T\) by (38);
4. Check \( V_{\text{max}} \) and \( T_{\text{max}} \) constraints for all tiles;
5. Increase the via density \( n \) according to weighted sensitivity \( S \) in the range of \((n_{\text{min}}, n_{\text{max}})\);
6. Update the structured and parameterized macromodel according to (34);
7. Repeat from Step 3 until Step 4 is satisfied;
8. **Output**: Via density vector \( n \).

Fig. 8. Algorithm 2 for the sensitivity-based via allocation with the use of macromodels.

Since our macromodel provides both nominal values and sensitivities, it can be incorporated in any gradient-based optimization. The overall optimization to solve the problem formulation (8) is outlined in Algorithm 2. Its inputs are two parts. The first is a principal system by (33) with the identified \( K \) principal input and output ports from Algorithm 1, and the clustered \( K \) principal parameters. The second is the user provided temperature bound \( T_{\text{max}} \), voltage bounce bound \( V_{\text{max}} \), signal-net congestion bound \( n_{\text{max}} \), and current-density bound \( n_{\text{min}} \). Then, a structured and parameterized macromodel is built once by (37). Both nominal responses and sensitivities at \( K \) principal input-ports for each perturbed allocation-pattern are solved by (38). If the integrity constraints are not satisfied for \( K \) principal tracks, the vias density vector \( n \) are increased according to the sensitivity. This process repeats until the integrity constraints are satisfied.

The sensitivity vector \( S \) in Step 3 is a weighted-maximum of normalized voltage-sensitivity vector \( S_V \) and thermal-sensitivity \( S_T \):

\[ S = \max \left[ \alpha \cdot \frac{\tilde{y}_1^{(1)}}{||\tilde{y}_1^{(1)}||}, \beta \cdot \frac{\tilde{y}_V^{(1)}}{||\tilde{y}_V^{(1)}||} \right] \in R^{p \times 1} \tag{40} \]
Table II. Electrical and thermal constants.

<table>
<thead>
<tr>
<th>layer</th>
<th>size</th>
<th>material</th>
</tr>
</thead>
<tbody>
<tr>
<td>heat-sink</td>
<td>2 cm × 2 cm × 1 mm</td>
<td>copper</td>
</tr>
<tr>
<td>device-layer</td>
<td>1 cm × 1 cm × 4 um</td>
<td>silicon</td>
</tr>
<tr>
<td>inter-layer</td>
<td>1 cm × 1 cm × 1 um</td>
<td>dielectric</td>
</tr>
<tr>
<td>P/G plane</td>
<td>2 cm × 2 cm × 10 um</td>
<td>copper</td>
</tr>
</tbody>
</table>

Table III. Dimensions of 3D ICs layers.

where \( \alpha \) and \( \beta \) are weights for \( \mathbf{S}_T \) and \( \mathbf{S}_V \), and the maximum of the normalized sensitivity is selected at each track.

The sensitivity vector is iteratively updated to calculate the new via density vector by

\[
\mathbf{n}^{(\text{iter}+1)} = \mathbf{n}^{(\text{iter})} + \gamma^{(\text{iter})} \cdot \mathbf{S}^{(\text{iter})},
\]

till the integrity constraints are satisfied at all principal output ports. Note that \( \gamma \) is an adaptive-controlled step size and geometrically decreases by a factor of 0.99 as the iteration proceeds. Recall that at each step, \( \mathbf{n} \) is constrained by the signal-congestion induced bound \( n_{\text{max}} \) and the current-density induced bound \( n_{\text{min}} \).

The overall computational cost of Algorithm 2 is low. This can be illustrated in two folds. Firstly, the use of the principal input/output identification and the principal parameter clustering leads to an efficient evaluation of integrity at principal ports. Secondly, due to the structure reduction, the nominal values and their sensitivities can be efficiently solved by the block-backward substitution of (38), and only the perturbed states are updated during each iteration according to (34).

A formal complexity analysis of Algorithm 2 is illustrated as follows. The main computational cost of Algorithm 2 is from steps 2 and 3, the model order reduction and update of the integrity and sensitivity. A \( q \)-th order model order reduction by Krylov-subspace projection needs one LU decomposition, \( q \) solves and the orthonormalization. Due to the lower-block triangular structure, the LU factorization is only applied for the nominal matrix in the diagonal. Moreover, there are only \( K \) off-diagonal blocks for \( K \) principal parameters. As such the total complexities of model order reduction is \( O(qN^{1.0} + q(KN)^{1.0} + Nq^2) \) \( (x, y \in [0, 9]) \). The computational cost of updating integrity and sensitivity is \( O(q^3 + Kq^2) \). The model order reduction is done only once and the update is done iteratively until the optimization converges.

6. EXPERIMENT RESULTS

6.1 Settings

The proposed two algorithms have been implemented in C and MATLAB. Experiments are run on a Sun-Fire-V250 workstation with 2G RAM. The via allocation based upon the steady-state analysis [Goplen and Sapatnekar 2005; Cong...
and Zhang. 2005] is used as the baseline. The reported number of vias is for power/ground vias, silicon, copper and dielectric are assumed for via, heat-sink, active device layer, inter-layer and PG plane, respectively. Table 2 and Table 3 summarize their electrical and thermal constants and dimensions. During the sensitivity-driven allocation, the maximum step $\gamma$ is $3e4$ with the damping factor 0.99, the weight $\alpha$ and $\beta$ are equal (0.5) for (40). We assume that the ambient air has a fixed temperature $40^\circ C$, and the heat-convection resistance is $0.8K/W$. The Vdd is 2.0V. In addition, to consider the impact of temperature to the electrical resistance, after each optimization, we update the resistance in the power delivery according to $R = R_0(1 + 0.0068 \times T)$. The targeted voltage violation $V_{\text{max}}$ is 0.2V and the targeted temperature $T_{\text{max}}$ is $52^\circ C$. Two modest 3D stackings are assumed for the experiments. One is with 2-device-layer/2-dielectric-layer and the other is with 4-device-layer/4-dielectric-layer. Moreover, there are 1-heat-sink and 2-P/G-plane. We increase the circuit complexity by increasing the complexity of circuits, the number of input sources, the number of discretized tiles, and the number (level) of tracks for adding vias. The correlation of input or output ports is extracted when the number of ports is large (> 50) and is sparsified with a pruning-threshold (1e-6).

As for the thermal model at each active device layer, we have the following assumptions as we have no real designed 3D ICs. Our paper assumes the use of MCNC (http://www.cbl.ncsu.edu/pub/Benchmark_dirs/LayoutSynth92) and IBM benchmark (http://er.cs.ucla.edu/benchmarks/ibm – place/) circuits with prior known information of placement and global routing for signal nets [Goplen and Sapatnekar 2005; Yu et al. 2006; Yu et al. 2006]. Each circuit is replicated and placed at each device layer, where the signal nets at two adjacent layers are assumed to aligned to connect with each other through vertical signal vias. The signal-net congestion $n_{\text{max}}$ is thereby calculated by estimating the number of signal nets in each tile at one layer. The $n_{\text{min}}$ for current density is assumed to be 10. Moreover, instead of using randomly generated thermal power [Goplen and Sapatnekar 2005; Yu et al. 2006; Yu et al. 2006], the thermal-power inputs are obtained from a pre-simulation of the transient power and temperature simulator [Liao et al. 2005].
within the floorplan of DEC-alpha micro-architecture. A sequence of six SPEC2000 applications (art, ammp, compress, equake, gcc, gzip) is applied as input and the thermal-power is recorded every 10-million clock cycles. The clock gating is assumed to have a period of 100ns and the power in the standby mode is assumed to be 20% of the running mode. The profile of thermal power is first uniformly divided inside each functional module on one layer, and then counted as inputs for placed circuits at that layer. An industrial package model is employed with the extracted RLCs and locations of input signals. The switching currents are modeled by the triangular waveform with falling time and rising time 5ns. Similarly, the sequences of input currents are generated by simulating the specified input vectors for an evaluation period of millions of clock cycles.

6.2 Principal Ports Identification

The first example is a 3D stacking with 3-level (64) tracks. There are 6K tiles with 100 thermal-power inputs and 200 switching-current inputs. We first show the result of input compression. Fig. 9 shows the difference in magnitude for the singular-values of the correlation matrix for input ports. Clearly, the singular values of the correlation matrix decrease quickly for high orders. The number of principal input-ports can be determined according to the singular values. In this paper, the threshold value is set as 0.1 of the maximum singular value, and its according order is set as the number of principal input-ports. In Fig. 9, the K=8 is selected as such. Similarly, the same procedure is applied to identity principal inputs for the thermal-power. Fig. 10 compares the time-domain waveforms between our macromodels with port-reduction and the exact MNA solution at selected ports for the previous example in one period of thermal model and electrical model, respectively. The input-port matrix of the electrical model is approximated with K=16 and K=8, respectively. Then, a 6th-order projection is applied to generate the electrical macromodel. Similarly, the input-port matrix of the thermal model...

is approximated with K=10 and K=5, and an 3rd-order projection is applied to generate the thermal macromodel. Clearly, both macromodels are visually identical to the original ones.

Next, we show the result of output compression and clustering. For the electrical-RLC network, Fig. 11 (b) shows the clustering result by the spectral clustering, and Fig. 11 (a) shows the waveforms of 5 clustered ports at cluster-3. The singular-value-decomposition (SVD) discovers that the singular values degrade significantly after 4th order. Therefore, the first four singular-value vectors are used to compose the subspace for clustering. The Spectral analysis leads to 4 clusters with 5, 15, 24, 20 clustered ports, respectively. For the 5 clustered output ports at cluster-3, the waveforms are clearly similar with each other with a maximum peak difference by 0.01V. Based on the result of clustered outputs, i.e., tracks, the vias are allocated with uniform density in one cluster and non-uniform density at different clusters.

6.3 Macromodel based Optimization

For the same example above, Fig. 12 presents the successive decrease of the transient temperature and voltage violation during the simultaneous optimization at one of principal output ports. The via number of this example is minimized in 6 iterations with the targeted voltage bounce 0.2V and the targeted temperature 52°C. Clearly, our sensitivity based allocation can effectively and efficiently minimize the via numbers under the dynamic power and thermal constraints.

Fig. 13 further shows the steady-state temperature map across the bottom device layer. In this example, we assume that all thermal-power sources are located along one side of the device layer. The initial chip temperature at the bottom layer is 150°C, and its temperature profile at steady-state is shown in Fig. 13 (a). In contrast, the via-allocation results in a cooler temperature that closely approaches the targeted temperature as shown in Fig. 13 (b). Clearly, even at steady-state the temperature is still spatially variant. An accurate measure of integrity is therefore needed to consider a time and space averaged integrity at selected probing ports.
Fig. 12. Iterative optimizations showing the reduction of (a) temperature and (b) voltage violation by via-allocation.

Fig. 13. Steady-state temperature map of bottom device layer (a) before allocating via, and (b) after allocating via in a different temperature scales.

6.4 Comparison between Steady-state and Transient Thermal Analysis

In the following, we present the scalability study of via allocations. We increase the circuit complexity by the number of cells, input ports and output ports. Input and output compression are both applied when the number of ports is large (> 50). Table 4 summarizes the complexities of the circuits, the reduced sizes and the principal ports.

Fig. 14. Visualized via-distribution. (a) is the result under only the power-integrity constraint, (b) is the result under only the thermal-integrity constraint, and (c) is the result under both of power and thermal integrity constraints simultaneously.

We compare the runtime and the number of vias in Table 5. The runtime includes the integrity and sensitivity analysis time and optimization time. As all optimizations usually converge in about 10 iterations, the optimization time is small and hence the analysis time dominates runtime. The baseline is the via allocation with the sequential optimization of thermal/power integrity using the static integrity [Goplen and Sapatnekar 2005]. In Table 5, column 2-3 show the runtime and allocated via number for the baseline, and column 4-8 show the results for the optimizations using the dynamic integrity. In detail, column 4 shows the runtime of transient analysis using macromodels without the port-compression, and column 5 shows the number of allocated vias under the sequential optimization. Column 6 shows the runtime of transient analysis using macromodels with the port-compression, and column 7-8 shows the number of allocated vias under the sequential and simultaneous optimizations, respectively.

We first compare the baseline with the design utilizing the dynamic analysis. Both methods allocate vias in a sequential fashion. I.e., we first allocate dummy thermal vias to satisfy thermal integrity with a targeted temperature gradient $52^\circ C$, and then allocate power/ground vias to satisfy power integrity with a targeted voltage bounce $0.2V$. As shown by Table 5, the vias are over-designed when using steady-state analysis. Compared to the optimization by steady-state analysis, the optimization by transient thermal analysis reduces vias by 12% for the first example and 17% for the second one on average. This is because our transient thermal analysis can accurately generate the dynamic thermal integrity, but the steady-state analysis has to assume a constant thermal integrity for all tiles.

Furthermore, the use of macromodels reduces the computational cost to solve power and thermal integrity and their sensitivities. Compared to the macromodel without the port-compression, the macromodeling with the port-compression reduces the overall runtime up to 16X with similar allocation results. Compared to the steady-state analysis with the full-matrix analysis, our macromodel with the
Table IV. The complexity of the original circuits and the reduced circuits including: the size, number of input-ports, and number of output-ports.

port-compression has a 127X smaller runtime. And the steady-state analysis cannot complete the largest example in a reasonable runtime. The maximum transient-waveform difference introduced by the macromodel is about 7% when compared to the exact transient waveform.

6.5 Comparison between Sequential and Simultaneous Optimizations

We further compare the sequential thermal/power optimization with the simultaneous thermal/power optimization. Here both methods allocate vias with the use of dynamic integrity. For the 2-layer example and the 4-layer example in Table 5, our simultaneous optimization reduces the via-cost up to 34% and 44.5% when compared to the sequential optimization with static integrity, and up to 22% and 27.5% when compared to the sequential optimization with dynamic integrity. This demonstrates that the reusing of power/ground vias can reduce the via cost compared to allocate the dummy thermal vias separately from the power/ground vias.

Fig. 14 further visualizes the results of circuit-4 (27K, 2-layer) in Table 5. As we divide a plane into a 2-dimensional output port array (16 x 16), the x and y-axes are the number of ports along the horizontal and vertical dimensions. (a) shows the distribution of the vias with 8 clusters when only using power integrity constraint, (b) shows the distribution of the vias with 10 clusters when only using thermal integrity constraint, and (c) shows the distribution of the vias when simultaneously using thermal and power integrity constraints. Clearly, thermal power inputs and switching current inputs lead to a difference distribution of vias. Due to the use of clustering, the vias are allocated with uniform density in tracks at one cluster and non-uniform density for different clusters.

7. CONCLUSIONS

The existing work on via-allocation does not consider thermal and power integrity simultaneously in 3D ICs. The wise assignment of power/ground vias is effective to reduce the size of the loop current and hence reduce the voltage bounce induced by the inductive coupling. By further extending power/ground vias to the top heatsink, they can be reused to remove the heat. An allocation of power/ground vias is presented in this paper with simultaneous consideration of the dynamic power and thermal integrity. Such a problem formulation can significantly avoid an over-design...
Table V. Comparisons of via number and runtime for the sequential optimization with steady-state analysis, the sequential optimization with transient analysis and the simultaneous optimization with transient analysis. Two macromodels are used during the transient analysis. Macromodel-1 does not use the port-compression, and macromodel-2 uses the port-compression.

<table>
<thead>
<tr>
<th>ckt</th>
<th>Steady-state(direct)</th>
<th>Transient(MACRO-1)</th>
<th>Transient(MACRO-2)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>runtime (s)</td>
<td>total via # by seq-opt</td>
<td>runtime (s)</td>
</tr>
<tr>
<td>ckt1(2-layer)</td>
<td>5.4</td>
<td>178800</td>
<td>0.63</td>
</tr>
<tr>
<td>ckt2(2-layer)</td>
<td>29.7</td>
<td>184900</td>
<td>0.81</td>
</tr>
<tr>
<td>ckt3(2-layer)</td>
<td>182.2</td>
<td>218100</td>
<td>18.6</td>
</tr>
<tr>
<td>ckt4(2-layer)</td>
<td>1269.2</td>
<td>234800</td>
<td>165.7</td>
</tr>
<tr>
<td>ckt5(2-layer)</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>ckt1(4-layer)</td>
<td>12.8</td>
<td>219200</td>
<td>1.9</td>
</tr>
<tr>
<td>ckt2(4-layer)</td>
<td>73.8</td>
<td>249000</td>
<td>4.1</td>
</tr>
<tr>
<td>ckt3(4-layer)</td>
<td>609.2</td>
<td>279300</td>
<td>93.7</td>
</tr>
<tr>
<td>ckt4(4-layer)</td>
<td>3308.7</td>
<td>302900</td>
<td>512.3</td>
</tr>
<tr>
<td>ckt5(4-layer)</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

when allocating vias in 3D ICs.

To efficiently solve the problem with dynamic-integrity constraints, two techniques, the compression of ports (parameters) and the structured and parameterized macromodeling are developed to effectively and efficiently generate both of the integrity and its sensitivity. Compared to the baseline of the existing approach using the steady-state thermal analysis and the sequential optimization, experiments show that our simultaneous optimization of power and thermal integrity reduces non-signal vias up to 45.5%.

ACKNOWLEDGMENTS

This work is partially supported by NSF CCR-0401682, and a UC-MICRO grant sponsored by Intel and Mindspeed. In addition, the authors thank the reviewers for insightful comments to make the paper better.

REFERENCES


Sapatnekar, S. ICICDT-2006. Physical design automation challenges for 3d ics. In Int. Conf. on IC Design and Tech.


