# Minimal Skew Clock Synthesis Considering Time Variant Temperature Gradient

Hao Yu, Yu Hu, Chun-Chen Liu and Lei He Electrical Engineering Department University of California, Los Angeles

### 1. INTRODUCTION

Clock synthesis, a crucial design step for high-performance VLSI circuits, has been extensively studied [1, 2, 3, 4, 5, 6]. Given a set of sinks (flip-flops), the clock synthesis is to find a topology and embedding with the minimized mismatch of arrival times, i.e. skew, and the wire length. Wire length minimization under zero/bounded skew constraint were developed in [1, 2, 3]. However, the process, supply voltage, and temperature (PVT) variation induced skew was not considered. This paper is focused on temperature variations.

The exponential advance of VLSI technology has resulted in a high and non-uniform power dissipation over the chip [7, 8, 9, 10], which leads to a temperature gradient, i.e., spatial temperature variation over a chip that can cause non-negligible delay differences in both interconnects and devices. As clock is globally routed over the chip, the temperature gradient can bring significant skew variations [4, 10]. Therefore, the traditional methods [1, 2, 3] without considering temperature gradient become invalid since the temperature can vary spatially. Considering the skew caused by the spatial variation of temperature, TACO [4] proposes a re-embedding of clock tree. This work assumes a time-invariant worst-case temperature map, and minimize the skew by a merging diamond. However, due to the increasing use of dynamic power managements, the (thermal) power is time-variant [11] or depends on the workload [12]. If the skew is optimized based on the time-invariant temperature map for one application, the resulting embedding is unlikely optimized and may lead to an excessive skew for other applications. The first contribution of our paper is to point out that to minimize the worstcase skew for microprocessors we need to consider the time-variant temperature gradient but not the static gradient as in [4].

In addition, it is expensive, if not impossible to find the worst-case skew by exhaustive search of time-variant temperature maps on the scale of thermal time-constant. On the other hand, different instructions or applications usually share some common resources (such as L2-cache), and therefore temperature are temporally correlated. The temperature correlation information enables us to cluster temperature maps and to calculate the worst case skew more efficiently based on the clustered temperature maps. The second contribution of this paper is to develop an efficient algorithm, PErturbation based Clock Optimization (PECO)<sup>1</sup> leveraging the clustering idea to find a clock embedding with the minimal worst case skew. The experimental results show that our PECO algorithm reduces the worst

case skew by up to 10X compared to DME [3] within 1% wirelength overhead. By combining PECO and cross-link insertion [13], the worst case skew is reduced by up to 20X compared to DME and up to 7X compared to the original cross-link insertion algorithm [13] considering time-invariant variations.

The rest of the paper is organized as follows. A stochastic temperature model is described in Section 2. The PErturbation based Clock Optimization (PECO) problem is formulated and solved in Section 3. The experimental results are given in Section 4. This paper is concluded in Section 5.

# 2. TEMPERATURE CORRELATION EXTRACTION

We first extract the temperature gradient and correlation by a micro-architecture level power and temperature simulation [12], where the alpha architecture is used as an example. The overall chip is discretized into a uniform grid with total N nodes. By applying six different applications (*ammp, art, compress, equake, gzip, gcc*) from SPEC2000 benchmark in a sequence (each with a time-period  $t_p$ ), the thermal-power is obtained by averaging the cycle-accurate (scale of ps) dynamic power in the thermal-constant scale (scale of ms). Using this time-variant thermal power as input, the transient-temperature  $T(t_i, n_j)$  over the chip is calculated at different time-instant  $t_i$  for each node  $n_j$  in the grid.

In the following, we propose an automatic correlation extraction for the temperature. The temperatures at N nodes are modeled by random processes. Each node is described by a temperature sequence sampled at N time-instants,

$$\begin{split} \mathcal{S}_{n_1} &= \{T(t_1, n_1), ..., T(t_p, n_1), T(t_{p+1}, n_1), ..., T(t_{\mathcal{N}}, n_1)\}\\ \mathcal{S}_{n_2} &= \{T(t_1, n_2), ..., T(t_p, n_2), T(t_{p+1}, n_2), ..., T(t_{\mathcal{N}}, n_2), ...\}\\ ...\\ \mathcal{S}_{n_N} &= \{T(t_1, n_N), ..., T(t_p, n_N), T(t_{p+1}, n_N), ..., T(t_{\mathcal{N}}, n_N)\} \end{split}$$

a temperature (spatial) correlation matrix is defined by

$$C(i,j) = \frac{cov(i,j)}{\sigma_i \cdot \sigma_j},\tag{1}$$

where

$$cov(i,j) = \sum_{k=1}^{N} T(t_k, n_i) T(t_k, n_j) - \sum_{k=1}^{N} T(t_k, n_i) \sum_{k=1}^{N} T(t_k, n_j)$$
(2)

is the co-variance matrix between nodes,

$$\sigma_i = \sqrt{\sum_{k=1}^{N} T(t_k, n_i)^2 / \mathcal{N} - (\hat{T}_i)^2}, \quad \sigma_j = \sqrt{\sum_{k=1}^{N} T(t_k, n_j)^2 / \mathcal{N} - (\hat{T}_j)^2}$$
(3)

<sup>\*</sup>This project was partially supported by SRC project 1116 and NSF CAREER award 0401682. The presenter of this paper is Yu Hu. Address comments to lhe@ee.ucla.edu.

<sup>&</sup>lt;sup>1</sup>It also represents Parameterized Engineering Change Orders, as re-embedding is a form of ECO (Engineering Change Order).

are the standard deviations for nodes  $n_i$  and  $n_j$ , and

$$\hat{T}_i = \sum_{k=1}^{\mathcal{N}} T(t_k, n_i) / \mathcal{N}, \quad \hat{T}_j = \sum_{k=1}^{\mathcal{N}} T(t_k, n_j) / \mathcal{N}.$$
(4)

are mean-temperature for nodes  $n_i$  and  $n_j$ . The correlation coefficients C(i, j) can be precomputed and stored into a table. The experimental results show that the correlation is strong as the average correlation is about 0.8 for the six SPEC2000 applications.

# 3. FORMULATION AND ALGORITHMS

#### **3.1** Formulation and Algorithm Overview

To robustly embed a clock tree with the minimal skew introduced by the time-variant temperature gradient, our starting point is an initial balanced skew clock tree obtained by DME method [2, 3] under a uniform temperature map. The PErturbation based Clock Optimization (PECO) problem is formulated as follows.

FORMULATION 1. Given the source Src, sinks  $s_1 \cdots s_n$ , initial topology of one clock tree (with embedding), and one stochastic temperature map for a sequence of applications, our perturbation based algorithm is to find a re-embedded clock tree to minimize the worse-case skew and wire length considering the temperature gradient and correlation.

Similar to TACO [4], we assume that the merging point (or the balanced skew point) M of the initial tree is locally adjusted level by level in a bottom-up fashion according to the abstract topology of the clock tree with the bottom one at level-0. In each level i (except for the bottom one), the perturbed merging point  $M'^i$  of node pair  $(s_1, s_2)$  is assumed to locate in a bounded region centered at the initial merging points  $M^i$  of this node pair. A re-embedding after perturbation is a pair of embedded paths  $P_1^i$  and  $P_2^i$  for each  $M'^i$ to the node pair  $(s_1, s_2)$ . To avoid the overhead of wire length, the perturbations of one merging point at level *i* are assumed to happen along four Manhattan directions (North, South, West, East) from  $M^i$ . The re-embedding path from a merging point to its child is the path with minimal sum of the standard variations in the temperature map grids. For each level, we enumerate all possible permutation combinations of the merging points and then calculate the sensitivity of the worst skew with respect to perturbation configurations in the current level. Finally, we select those perturbations that potentially minimizes the worst case skew and propagate them to the next level.

#### 3.2 Parameterized Perturbation

To accurately calculate the worst-case skew, we first build an electrical circuit model for the clock tree. Each segmented clock wire is represented by a distributed  $\pi$ -circuit with the capacitors in nodes and resistors in edges. The node capacitor includes both the wire and loading capacitance.

Given the initial tree T with  $N_l$  levels and N nodes, its system equation can be described by the Modified Nodal Analysis (MNA) in frequency-domain

$$(G_0 + sC_0)x(s) = Bu \tag{5}$$

$$y(s) = L^{T} x(s) \tag{6}$$

where  $G_0$  and  $C_0 (\in \mathbb{R}^{N \times N})$  are the conductive and capacitive matrices for an initial tree, x(s) is the state variable for voltage, and y(s) is the output voltage. Note that x includes n sinks and m Steiner points, which can be represented by

$$x = [x_{s_1}, \cdots, x_{s_n}, x_{Src}, x_{st_1}, \cdots, x_{st_m}]^T$$
(7)

where  $s_i, i = 1, \dots, n$  are sinks and  $st_i, i = 1, \dots, m$  are Steiner points.

Since the source node in the tree is the only input of the system, the input port matrix  $B \ (\in \mathbb{R}^{N \times 1})$  becomes

$$B = [\underbrace{0, \cdots, 0}_{n}, 1, \underbrace{0, \cdots, 0}_{m}]^{T}.$$
(8)

It selects the input with source u(s). In this paper, we assume an impulse input at the source node. In addition, because only the voltage responses at sinks are interested, the output port matrix L at level  $i \in \mathbb{R}^{N \times n}$  becomes

$$L^{i} = \begin{bmatrix} I_{n \times n} \\ 0_{(N-n) \times n} \end{bmatrix}$$
(9)

The according selected output y is

$$y = [y_{s_1}, \cdots, y_{s_n}].$$
 (10)

It is in fact a SIMO (single-input-multi-output) system.

When a clock wire experiences a temperature gradient, the unitlength resistance  $r_{unit}$  is [10]

$$r_{unit}(x, y, t) = \rho_0 \cdot [1 + \beta \cdot T(x, y, t)],$$
(11)

where  $\rho_0$  is the unit-length resistance at  $0^{\circ}C$ , and  $\beta$  is the temperature coefficient of resistance  $(1/^{\circ}C)$ . When the embedding path  $d(M'^i, s_k)$  is fixed, we calculate the new resistance by

$$R(M'^{i}, s_{k}) = \sum_{\forall e \in d(M'^{i}, s_{k})} E[r_{unit}(e)] \cdot len(e)$$
(12)

where  $E[r_{unit}(x,y)]$  is the mean value of resistance in edge  $e\left(M^{\prime i},s_k\right)$ 

The impact of temperature gradient is modeled by a perturbation that modifies the location of the merging point. As a result, there is a correspondent change in the conductance matrix G. Therefore, a perturbed MNA under all perturbation configurations becomes

$$[(G_0 + \delta G_1) + s(C_0 + \delta C_1)] \cdot (x + \delta x_1) = Bu$$
  

$$[(G_0 + \delta G_2) + s(C_0 + \delta C_2)] \cdot (x + \delta x_2) = Bu$$
  

$$...$$
  

$$[(G_0 + \delta G_I) + s(C_0 + \delta C_I)] \cdot (x + \delta x_I) = Bu.$$
 (13)

Similar to the approach in [14], by organizing the expanded terms in the order of perturbations, and defining a new state variable

$$x_P = [x, \delta x_1, \cdots, \delta x_I]^T \tag{14}$$

composed by nominal values and first-order perturbations, an augmented system equation could be obtained

$$(G_P + sC_P)x_P = B_P u, \quad y_P = L_P^T x_P, \tag{15}$$

where

$$G_P = \begin{bmatrix} G_0 & 0 & \dots & 0 & 0 & 0 \\ \delta G_1 & G_0 & 0 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ \delta G_i & 0 & \dots & G_0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \delta G_I & 0 & 0 & \dots & 0 & G_0 \end{bmatrix}$$
(16)

$$B_P = [B, \overbrace{0, \cdots, 0}^{I-1}]^T \tag{17}$$

and

$$L_P = Diag(\overbrace{L_0, \cdots, L_0}^{I}).$$
(18)

Note that  $C_P$  has a similar structure as  $G_P$ . In addition, the output is

$$y_P = [y, \delta y_1, \cdots, \delta y_I]^T, \tag{19}$$

where each  $\delta y_i$  is a length-*n* vector, representing the voltage response and changes for each sink. As shown in Section 3.4, by solving the above system equation (15) in time-domain, the worst-case skew at each level can be calculated from the voltage response y(t).

#### 3.3 Speedup by Parameter Clustering

The above parameterized macro model can be huge due to the numerous perturbation combinations, which is exponential to the number of merging points. On the other hand, due to the existence of the temperature correlation at different regions, many merging points share the similar perturbation configuration. Therefore, an efficient worst-case skew calculation needs to first build a compactly parameterized model by applying the correlation information.

Given pre-measured spatially and temporally variant temperature maps, a temperature correlation matrix C can be extracted for merging points in each level. As shown in Section 2, such a correlation matrix usually shows quite similar and large entries, hence C is usually low-rank. Physically, it infers that the local temperature values within the same group of merging points change in the similar fashion. Therefore, if one can find the real-rank Kof C, it may guide us to cluster the merging pints into K groups. Singular Value Decomposition (SVD) [15] can be used to find the rank-K approximation,  $rank(K) = SVD(C, \epsilon_1)$ , for a low-rank matrix, where the rank(K) is obtained when the first K singular values of C are larger than  $\epsilon_1$ . The procedure of partitioning all merging points into K clusters is done by K-means clustering algorithm [16], where the correlation matrix is treated as the distance attribution matrix required by K-means clustering algorithm.

# 3.4 Worst-case Skew Calculation

The augmented system (16) has a larger size than the original one. To efficiently calculate the skew and the skew sensitivity with respect to perturbation, the augmented system can be reduced by projecting with a small sized and othronormalized matrix V( $\in \mathbb{R}^{N \times q}$ ,  $N \gg q$ ) [17, 14]. As it is a SIMO system, the reduction cost does not depend on the port number.

However, the projection matrix V in [17, 14] is flat. It can not separate the sensitivity from the nominal value during the projection. In this paper, instead of using the flat projection matrix V, a structured projection matrix is constructed similarly to [18, 19]

$$\mathcal{V} = diag[V_0, V_1, \cdots, V_K], \tag{20}$$

where V is partitioned according to the dimension of x and  $\delta x_i$ (i = 1, ..., K). Projecting by the new V, the reduced system becomes:

$$\widetilde{G}_{PC} = \mathcal{V}^T G_{PC} \mathcal{V} \quad \widetilde{C}_{PC} = \mathcal{V}^T C_{PC} \mathcal{V} \quad (\in \mathbb{R}^{q \times q})$$
$$\widetilde{B}_{PC} = \mathcal{V}^T B_{PC} \quad \widetilde{L}_{PC} = \mathcal{V}^T L_{PC}. \tag{21}$$

The time-domain voltage response of the dimension reduced macro-model can be calculated by the Backward-Euler method

$$(\widetilde{G}_{PC} + \frac{1}{h}\widetilde{C}_{PC})x_{PC}(t) = \frac{1}{h}\widetilde{G}_{PC}x_{PC}(t-h) + \widetilde{B}_{PC}u(t)$$
$$y_{PC}(t) = \widetilde{L}_{PC}^T x_{PC}(t).$$
(22)

Note that the reduced system matrix has a preserved block-triangular structure and can be efficiently solved by factorizing only the diagonal block ( $G_0$  and  $C_0$ ). Moreover, there is no need to construct a large augmented state matrix when the block matrix data structure is implemented.

Accordingly, the skew and skew sensitivity can be calculated separately

$$\widetilde{y}_{PC} = [\widetilde{y}, \delta \widetilde{y}_1, \cdots, \delta \widetilde{y}_K]^T, \qquad (23)$$

and the total voltage response in each sink under the perturbation configuration  $PC_j^i$  at level i  $(i = 1, ..., N_l, j = 1, ..., K)$  is

$$\widetilde{y}_{PC_i^i}(t) = \widetilde{y}^i(t) + \delta \widetilde{y}^i_j(t).$$
(24)

The algorithm performs in a bottom-up fashion by solving  $y^1, y^2, ..., y^{N_l}$  level by level sequentially.

Following the conventional definition for the propagation delay,  $D_{(Src \leftrightarrow s_i)}$ , from the source node Src to sinks  $s_i$ , is the time required for the node voltage (waveform) to pass 65% of the peak voltage under the impulse excitation in the source node. After obtaining the source to sink delay of *j*-th perturbation configuration  $PC_j^i$  in level *i*, we can calculate the worst case skew corresponding to  $PC_i^i$  as follows

$$Skew_i = \max_{\forall s_k} D_{(Src \rightsquigarrow s_k)} - \min_{\forall s_k} D_{(Src \rightsquigarrow s_k)}.$$
(25)

The worst-case skew is then determined by those preserved perturbations from all levels.

In summary, the flow of the overall PECO algorithm is as follows. After a DME-initialized clock tree construction, the re-embedding by perturbation is performed as follows. The worst-case skew and re-embedding are determined in a bottom-up fashion. At each level, the merging points are clustered according to their correlation strength. Then the resulting clusters are perturbed with five possible perturbation configurations, and those that could cause large skew changes (sensitivity) are selected for re-embedding. Due to the bounded perturbations, i.e., the perturbation radius is confined, as shown by experiments, the overhead of wire length is negligible.

#### 3.5 Integration with Cross-link Insertion

Cross-link insertion has been used to construct robust clock network under time-invariant variations [13]. To explore the impact for worst case clock skew caused by cross-link insertion, we extend cross-link insertion algorithm in [13] by considering high order delay model and simultaneously performing cross-link insertion and merging points perturbation. The integrated algorithm in called PCC. The overflow of PCC algorithm is as follows. The clock tree is initialized by DME algorithm. The candidate node pairs for link insertion are then identified according to  $\alpha$ ,  $\beta$  and  $\gamma$  rules [13]. After that the link capacitances are added to the selected nodes, and the merging point locations are tuned by PECO algorithm to restore original skew. Finally the link resistances are inserted in selected node pairs.

#### 4. EXPERIMENT RESULTS

We have implemented PECO algorithm and PCC algorithm. All experiments are conducted with benchmark r1-r5 in a Linux server with Intel Xeon 1.9GHz CPU and 2GB memory. The initial zero-skew tree is constructed by the DME method under the Elmore delay model with no temperature variation. Suggested by [4], we assume that the interconnect has unit resistance  $r_0 = 0.03\Omega/mum$ , unit capacitance  $c_0 = 2.0 \times 10^{-16} F/\mu m$  and temperature sensitivity  $\beta = 0.0068$ .

| Input(node) | wire-length $(um)$ |         |         |         | worst case skew $(100\text{-map})(ps)$ |        |       |       | $\operatorname{runtime}(s)$ |      |      |       |
|-------------|--------------------|---------|---------|---------|----------------------------------------|--------|-------|-------|-----------------------------|------|------|-------|
|             | DME                | [13]    | PECO    | PCC     | DME                                    | [13]   | PECO  | PCC   | DME                         | [13] | PECO | PCC   |
| r1(267)     | 1.32e06            | 1.42e06 | 1.37e06 | 1.38e06 | 144                                    | 43     | 108.7 | 6.7   | 0.5                         | 1.1  | 1.1  | 1.4   |
| r2(598)     | 2.56e06            | 2.68e06 | 2.61e06 | 2.63e06 | 534                                    | 79     | 161.3 | 16.8  | 1.0                         | 3.2  | 4.5  | 5.3   |
| r3(862)     | 3.38e06            | 3.53e06 | 3.44e06 | 3.47e06 | 587                                    | 121    | 198.7 | 162.5 | 1.4                         | 4.7  | 6.1  | 13.2  |
| r4(1903)    | 6.82e06            | 6.97e06 | 6.85e06 | 6.93e06 | 1428                                   | 429    | 136.5 | 140.7 | 2.1                         | 5.5  | 33.6 | 59.0  |
| r5(3101)    | 1.02e07            | 1.19e07 | 1.05e07 | 1.07e07 | 3533                                   | 1162.2 | 1321  | 933.8 | 6.2                         | 11.4 | 86.6 | 191.4 |

Table 1: Comparisons among various clock synthesis algorithms



Figure 1: Skew distribution produced by PCC and DME for r1 to r4 on 100 temperature maps.

A chip with size  $6cm^2$  is divided into a uniform grid with  $100 \times 100$  regions to obtain the distributed temperature map by a micro-architecture level cycle-accurate power/temperature simulator [12]. We collect 100 temperature maps by simulating six SPEC2000 applications (art, ammp, compress, equake, gcc, gzip) in a sequence and recording temperature maps for every 10 million clock cycles after fast forwarding of 1 billion cycles. These applications lead to a temperature variation about  $50^{\circ}C$  over the 100 temperature maps. We then find regions of hot spot with high temperature variation (variance), and extract the correlation matrix, i.e., covariance for pairwise regions. In this paper we report the worst skew by SPICE simulation for the above 100 temperature maps unless stated otherwise.

Table 1 compares our PECO and PCC algorithms to DME and cross-link Insertion algorithm [13]. PECO algorithm reduces the worst-case skew by up to 10X compared to DME. Comparing time-invariant variation aware cross-link insertion [13] and PECO, PECO has 13% larger worst-case skew while [13] increases wire length by 5% on average. By combining PECO and cross-link insertion, PCC reduces the worst-case skew by up to 20X compared to DME and up to 7X compared to the cross-link insertion from [13]. Note that both PECO and PCC have less than 1% wire length overhead compared to DME. Due to the employment of high order macro model in PECO and PCC, the runtimes of PECO and PCC are up to 15X and 20X compared to those for DME and [13], respectively, while it is still acceptable for optimizing a clock tree with over 3K sinks (r5). The skew distributions produced by DME and PCC for r1-r4 over 100 temperature maps are shown in Figure 1.

#### 5. CONCLUSIONS

Time-variant temperature gradient induced clock skew was not considered in the existing clock synthesis algorithms [1, 20, 3, 4]. In this paper, we propose a minimal skew clock tree embedding that considers the time-variant temperature gradient with spatial and temporal correlations. Parameterized perturbation is used to model the physical location changes of merging points. The experimental results show that our algorithm reduces worst-case skew by up to 10X compared to the existing zero-skew based approach DME [20, 3] with small (no more than 1%) wirelength overhead. In addition, by integrating the cross-link insertion into our parameterized perturbation framework, the worst-case skew is reduced by up to 20X compared to DME and up to 7X compared to the cross-link insertion algorithm considering time-invariant variations [13].

#### 6. **REFERENCES**

- R.-S. Tsay, "An exact zero skew clock routing algorithm," TCAD, pp. 242–249, 1993.
- [2] T.-H. Chao, Y.-C. H. Hsu, J.-M. Ho, K. D. Boese, and A. B. Kahng, "Zero skew clock routing with minimum wirelength," TCAS, vol. 39, pp. 799–814, Nov. 1992.
- [3] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. A. Tsao,
- "Bounded-skew clock and Steiner routing," TODAES, 1997.
  [4] M. Cho, S. Ahmed, and D. Z. Pan, "TACO: Temperature aware clock-tree optimization," in *ICCAD*, 2005.
- [5] J. Tsai, L. Zhang, and C. C. Chen, "Statistical timing analysis driven post-silicon-tunable clock-tree synthesis," in *ICCAD*, 2005.
- [6] U. Padmanabhan, J. M. Wang, and J. Hu, "Statistical clock tree routing for robustness to process variations," in *ISPD*, 2006.
- [7] C. C. Teng, Y. K. Cheng, E. Rosenbaum, and S. M. Kang, "iTEM: A temperature-dependent electromigration reliability diagnosis tool," *TCAD*, pp. 882–893, 1997.
- [8] K.Banerjee, A.Mehrotra, A.Sangiovanni-Vincentelli, and C. Hu, "On thermal effects in deep sub-micron VLSI interconnects," in DAC, 1999.
- [9] T.-Y. Chiang, B. Shieh, and K. Sarawat, "Impact of Joule heating on scaling of deep sub-micron Cu/low-k interconnects," in *ISVLSI*, 2002.
- [10] A. H. Ajami, K. Banerjee, and M. Pedram, "Modeling and analysis of nonuniform substrate temperature effects on global ULSI interconnects," TCAD, pp. 849–861, 2005.
- [11] K. Skadron, M. Stan, and et. al., "Temperature-aware microarchitecture," in Proc. Intl. Symp. on Computer Architecture, 2006.
- [12] W. Liao, L. He, and K. Lepak, "Temperature and supply voltage aware performance and power modeling at microarchitecture level," *TCAD*, pp. 1042–1053, 2005.
- [13] A. Rajaram, J. Hu, and R. Mahapatra, "Reducing clock skew variability via crosslinks," in *ICCAD*, 2004.
- [14] X. Li, P. Li, and L. Pileggi, "Parameterized interconnect order reduction with explicit-and-implicit multi-parameter moment matching for inter/intra-die variations," in *ICCAD*, 2005.
- [15] G. H. Golub and C. F. V. Loan, *Matrix Computations*. Baltimore, MD: The Johns Hopkins University Press, 3 ed., 1989.
- [16] J. B. MacQueen, "Some methods for classification and analysis of multivariate observations," in *Proceedings of 5-th Berkeley* Symposium on Mathematical Statistics and Probability, 1967.
- [17] A. Odabasioglu, M. Celik, and L. Pileggi, "PRIMA: Passive reduced-order interconnect macro-modeling algorithm," *TCAD*, pp. 645–654, 1998.
- [18] H. Yu, L. He, and S. Tan, "Block structure preserving model reduction," in BMAS, 2005.
- [19] H. Yu, Y. Shi, and L. He, "Fast analysis of structured power grid by triangularization based structure preserving model order reduction," in DAC, 2006.
- [20] T. H. Chao, Y-C.Hsu, J-M.Ho, K.D.Boese, and A. Kahng, "Zero skew clock routing with minimum wirelength," *TCAS II*, pp. 799–814, 1992.