### Minimal Skew Clock Embedding Considering Time Variant **Temperature Gradient** Hao Yu, Yu Hu, Chunchen Liu, and Lei He **Electrical Engineering Department** University of California Los Angeles, CA 90095 hy255@ee.ucal.edu, hu@ee.ucla.edu, chunta@ee.ucla.edu, lhe@ee.ucla.edu #### **ABSTRACT** The existing temperature-aware clock embedding assumes a time-invariant temperature gradient. However, it is not solved how to find the worst-case temperature gradient leading to the worst case skew. In this paper, we develop a PErturbation based Clock Optimization (PECO) considering the timevariant temperature gradient. For a given clock topology, we minimize the worst case skew without asking for the worst case temperature map. We decide the merging point level by level based on the sensitivity of the skew with respect to the change of merging point. Such sensitivity is calculated using a parameterized model, which is compressed by a singularvalue-decomposition (SVD) and K-means based clustering considering the temperature correlation. The experimental results show that our algorithm reduces worst-case skew by up to 5X compared to the existing zero skew based ZST/DME method with small (up to 1%) wirelength overhead. Categories and Subject Descriptors: B.7.2 [Hardware]: Integrated circuits - Design aids General Terms: Algorithms, design Keywords: Clock Tree Design, Thermal Management, Parameterized Perturbation, Compact Parameterization #### 1. INTRODUCTION Clock synthesis is important to design high-performance VLSI circuits [1, 2, 3, 4, 5, 6]. Given a set of sinks (flip-flops), the clock routing is to find a topology and embedding with the minimized mismatch of arrival times, called skew, and the wire length. An exact zero-skew algorithm is developed in [1]. To reduce wire length, a deferred merge embedding (DME) [2] method is introduced by maintaining a set of merging points. Moreover, instead of using a zero skew, a bounded skew tree (BST) algorithm is proposed in [3] to minimize the wirelength under a non-zero skew bound. All these methods [1, 2, 3] minimize the skew in the nominal Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISPD'07, March 18–21, 2007, Austin, Texas, USA. Copyright 2007 ACM 978-1-59593-613-4/07/0003 ...\$5.00. case of process, supply voltage, and temperature (PVT). Due to the variation in PVT, however, there exist extra clock skews. This paper focuses on temperature but can be extended to handle the skew caused by variations from the process and supply voltage. The exponential advance of VLSI technology has resulted in a high and non-uniform current density and power dissipation over the chip [7, 8, 9, 10], which leads to a temperature gradient, i.e., spatial temperature variation over a chip that can cause non-negligible delay differences in both interconnects and devices. As clock is globally routed over the chip, the temperature gradient can bring significant skew variations [10, 4]. Therefore, the traditional constant zero/bounded skew routing methods [1, 2, 3 become invalid since the temperature can vary spatially. Accordingly, the balanced skew points or called merging points can be shifted with presence of the temperature gradient. As a result, it requires a clock synthesis to find and minimize the worst-case skew considering the temperature gradient. Considering the skew caused by the spatial variation of temperature, TACO [4] proposes a re-embedding of clock tree. This work assumes a time-invariant worst-case temperature map, then minimize the skew by a merging diamond. Due to the increasing use of dynamic power managements, the (thermal) power density is time-variant [11] or depends on the workload. Figure 1 shows the temperature maps by a micro-architecture level power and temperature co-simulation [12] for SPEC2000 benchmark applications ammp and gzip. Clearly, the temperature maps are quite different under different applications. If the skew is optimized based on the time-invariant temperature map for the application ammp, the resulting embedding is clearly not optimized and leads to a large skew for the application qzip. The first contribution of our paper is to point out that to minimize the worst-case skew for microprocessors we need to consider the time-variant temperature gradient but not the static gradient as in [4]. However, it is expensive, if not impossible to find the worst-case skew by exhaustive search of time-variant temperature maps on the scale of thermal time-constant. Because different instructions or applications usually share some common resources (such as L2-cache) systematically controlled by designers, power and therefore temperature are correlated. Figure 2 shows the correlation between regionbased temperature maps for six selected SPEC2000 benchmark applications with details discussed in Section 2. Although these applications have totally different temperature maps, there exists a strong spatial correlation between tem- <sup>\*</sup>This paper is partially supported by NSF CAREER award CCR-0093273 and UC-MICRO fund from Intel. Address comments to lhe@ee.ucla.edu. Figure 1: The temperature gradient over the chip after 10 million cycles. (a) is for application:ammp and (b) is under application:gzip. peratures for different chip regions. At the first glance, the correlated temperature complicates the worst-case skew calculation, but in fact such correlation enables us to cluster temperature maps and to calculate the worst case skew more efficiently based on the clustered temperature maps. The second contribution of this paper is to develop an efficient algorithm, PErturbation based Clock Optimization (PECO) <sup>1</sup> leveraging the clustering idea to find a clock embedding with the minimal worst case skew. PECO introduces the parameterized perturbation to model layout changes of merging-points in state matrices. One temperature correlation matrix is extracted under a sequence of temperature maps for different phases of an application or for different applications (the workload) for one chip. The correlation is used to prune redundant perturbations. The skew and its sensitivity are accurately and efficiently calculated from a structured and parameterized macromodel. The sensitivity is then used to prune insignificant solutions during the search of the worst-case skew level by level. In contrast, TACO [4] searches the merging point geometrically along the direction of the temperature gradient. It is true that the larger the temperature varies, the bigger the resistance changes and so does the delay. But it does not guarantee that the skew ought to change similarly as skew is a delay difference. As a result, the approach in [4] can be still inaccurate to determine and minimize the worst-case skew even if the worst-case temperature map is given. Our formulation is quite different from the worst-case temperature-map formulation used in TACO [4], which has not solved how to find the worst-case temperature map and an arbitrary temperature map may lead to an arbitrarily large clock skew for another temperature map. Therefore, only ZST/DME [2, 3] is used for the comparison. The experimental results show that our PECO algorithm reduces the worst case skew by up to 5X compared to ZST/DME with up to 1% wirelength overhead. The rest of the paper is organized as follows. We present Figure 2: The spatial correlation map under a given sequential applications. a stochastic temperature model in Section 2. We formulate the PErturbation based Clock Optimization (PECO) problem in Section 3. We introduce a compactly parameterized description for merging point perturbations in Section 4, and develop an efficient worst-case skew calculation in Section 5. We present the overall PECO algorithm and results in Section 6 and conclude the paper in Section 7. ## 2. TEMPERATURE CORRELATION EXTRACTION We first extract the temperature gradient and correlation by a micro-architecture level power and temperature simulation [12], where the alpha architecture is used as an example. The overall chip is discretized into a uniform grid with total N nodes. By applying six different applications (ammp, art, compress, equake, gzip, gcc) from SPEC2000 benchmark in a sequence (each with a time-period $t_p$ ), the thermal-power is obtained by averaging the cycle-accurate (scale of ps) dynamic power in the thermal-constant scale (scale of ms). Using this time-variant thermal power as input, the transient-temperature $T(t_i, n_j)$ over the chip is calculated at different time-instant $t_i$ for each node $n_i$ in the grid. Figure 1 shows the temperature maps for two different applications at selected time instants, where (a) is for application: ammp at 11-million cycles and (b) is for application:gzip at 15-million cycles. Clearly, their temperatures are quite different and hence each can lead to a significantly different skew variation. However, it is computationally expensive to determinate the worst-case skew from multiple temperature maps at different time-instants. Accordingly, the resulting location change of the mering point can have a huge number of combinations. Because of the resource sharing, the temperature maps at different regions actually have a high spatial correlation. In other words, it is quite possible that changes of temperature at some regions could be similar. Therefore, if we start with a balanced tree under the nominal temperature, the similar temperature gradient could lead to a similar change of resistance in two subtrees, and hence a similar location change of the merging point. In the following, we propose an automatic correlation extraction for the temperature. The temperatures at N nodes are modeled by random processes. Each node is described <sup>&</sup>lt;sup>1</sup>It also represents Parameterized Engineering Change Orders, as re-embedding is a form of ECO (Engineering Change Order). | Variable | Physical meaning | |--------------------|--------------------------------------------------| | $T(t_i, n_j)$ | transient-temperature | | $\mathcal{C}(i,j)$ | correlation coefficient between node $i$ and $j$ | | $n_{src}$ | source node | | $M^i$ | set of merging points at level $i$ | | $M'^i$ | perturbed merging points at level i | | $P^i$ | path from $M$ to sink | | PC | perturbation configuration | | $d(M'^i, s_k)$ | embedding path | | $R(M',s_i)$ | new resistance after merging | | $G_0, C_0$ | conductive and capacitive matrices | | x,y | voltage state variable/output | | $S_i$ , $St_i$ | sinks and Steiner points | | B/L | input/output port matrix | | $\delta G_i$ | conductive change by $PC^i$ | | V | projection matrix | | h | time step during Backward-Euler | Table 1: Index table for definitions and terms. by a temperature sequence sampled at $\mathcal{N}$ time-instants, $$\begin{split} \mathcal{S}_{n_1} &= \{T(t_1,n_1),...,T(t_p,n_1),T(t_{p+1},n_1),...,T(t_{\mathcal{N}},n_1)\}\\ \mathcal{S}_{n_2} &= \{T(t_1,n_2),...,T(t_p,n_2),T(t_{p+1},n_2),...,T(t_{\mathcal{N}},n_2),...\}\\ ...\\ \mathcal{S}_{n_N} &= \{T(t_1,n_N),...,T(t_p,n_N),T(t_{p+1},n_N),...,T(t_{\mathcal{N}},n_N)\} \end{split}$$ a temperature (spatial) correlation matrix is defined by $$C(i,j) = \frac{cov(i,j)}{\sigma_i \cdot \sigma_j},\tag{1}$$ where $$cov(i,j) = \sum_{k=1}^{N} T(t_k, n_i) T(t_k, n_j) - \sum_{k=1}^{N} T(t_k, n_i) \sum_{k=1}^{N} T(t_k, n_j) \quad (2)$$ is the co-variance matrix between nodes, $$\sigma_{i} = \sqrt{\sum_{k=1}^{N} T(t_{k}, n_{i})^{2} / \mathcal{N} - (\hat{T}_{i})^{2}}, \quad \sigma_{j} = \sqrt{\sum_{k=1}^{N} T(t_{k}, n_{j})^{2} / \mathcal{N} - (\hat{T}_{j})^{2}}$$ (3) are the standard deviations for nodes $n_i$ and $n_j$ , and $$\hat{T}_i = \sum_{k=1}^{\mathcal{N}} T(t_k, n_i) / \mathcal{N}, \quad \hat{T}_j = \sum_{k=1}^{\mathcal{N}} T(t_k, n_j) / \mathcal{N}.$$ (4) are mean-temperature for nodes $n_i$ and $n_j$ . The correlation coefficients C(i,j) can be precomputed and stored into a table. Figure 2 shows the distribution of one calculated correlation matrix. The averaged correlation is about 0.8. Accordingly, instead of using the fixed temperature map in [4], the temperature model, called the stochastic temperature map is used in this paper. The stochastic temperature map includes the average temperature, temperature variance for each chip region, and the spatial correlation between variances is considered for different regions. Note that there is only one stochastic temperature map extracted for the given sequence of applications. In Section 4, we will utilize the correlation information to cluster those regions with similar temperature changes and hence the similar location changes of merging points. As a result, the solution space to determine the worst-case skew is compacted by pruning those redundant solutions. Therefore, it becomes feasible and efficient to determine the the worst-case skew and hence the locations of merging points. Table 1 presents all definitions and terms used in this paper. # 3. PERTURBATION BASED CLOCK OPTIMIZATION (PECO) To robustly embed a clock tree with the minimal skew introduced by the time-variant temperature gradient, our starting point is an initial balanced skew clock tree obtained by ZST/DME [2, 3] method under a uniform temperature map. Because the real-time temperature only shows the variation in a bounded range, its impact to the change of physical location for the merging point is modeled by the perturbation in this paper. Similar to TACO [4], we assume that the merging point (or the balanced skew point) M of the initial tree is locally adjusted level by level in a bottom-up fashion according to the abstract topology of the clock tree with the bottom one at level-0. In each level i (except for the bottom one), the perturbed merging point $M'^i$ of node pair $(s_1, s_2)$ is assumed to locate in a bounded region centered at the initial merging points $M^i$ of this node pair. A re-embedding after perturbation is a pair of embedded paths $P_1^i$ and $P_2^i$ for each $M'^i$ to the node pair $(s_1, s_2)$ . As shown in Figure 3, to avoid the overhead of wirelength, the perturbations of one merging point at level i are assumed to happen along four Manhattan directions (North, South, West, East) from $M^i$ . But our method can be extended to consider more perturbed directions. Note that it is possible that the merging point $M^i$ will not change its physical location and hence stay at the origin. Therefore, there are total five perturbations assumed for one merging point. Moreover, because our re-embedding is performed level by level, the perturbation at $M_i$ is determined by the local temperature gradient in direction-k at one time instant $t_j$ . As a result, the perturbed $M'_i$ is confined by a time-variant radius $r^{k}(M^{i}, t_{j})$ $(k = 1, 2, 3, 4, j = 1, ..., \mathcal{N}),$ which can be approximated by the mean value over the Ntime instants and four directions. In addition, to further decide the embedding path for one perturbed merging point $M'^i$ , we need to find one path that has the smallest temperature gradient among all possible paths $M'^i \leadsto s_k$ . This path could be approximated by a shortest path determined as follows. Each sub-region j is given a weight of temperature standard deviation $std_j$ (j = 1, ..., N). Then the desired path is a shortest one with $$d(M^{\prime i}, s_k) = \min_{\forall P \in M^{\prime i} \leadsto s_k} \sum_{\forall q \in P} std(j)^2.$$ (5) I.e., the shortest path has the minimal sum of standard deviations along the path from the subtree root to one merging point. Accordingly, we formulate a problem for the clock tree embedding as follows FORMULATION 1. Given the source Src, $sinks s_1 \cdots s_n$ , initial topology of one clock tree (with embedding) by ZST/DME, and one stochastic temperature map for a sequence of applications, our perturbation based algorithm is to find a reembedded clock tree to minimize the worse-case skew and wire length considering the temperature gradient and correlation. We call this algorithm as PErturbation based Clock Optimization (PECO). A pseudocode of this algorithm will be presented in Subsection 6.1. The overall flow to solve this problem is briefly explained as follows. In the bottom-up phase of our algorithm, we first generate all perturbation Figure 3: Four Manhattan perturbation patterns for one merging point. The merging point could stay at origin. configurations in level i. As aforementioned, for each merging point, a perturbation configuration includes five candidates. We then calculate the sensitivity of the worst skew with respect to perturbation configurations in the current level. Finally, we select those perturbations that potentially minimizes the worst case skew and propagate them to the next level. Instead of calculating the skew by searching along the temperature gradient under the fixed temperature map [4], we propose to calculate the sensitivity of the worst case skew with respect to one perturbation configuration. Note that the temperature gradient only determines the change of resistance but not the change of skew, the strategy in [4] may not find the worst-case skew. On the other hand, the sensitivity of the skew can be used to capture the accurate relationship between the worst case skew and the perturbation configuration. Furthermore, we propose two effective pruning methods to efficiently yet accurately verify the worst-case skew. Note that a brute-force search of all different perturbation configurations at level i leads to $I=5^{N_i}$ combinations at for $N_i$ nodes at one time instant. As shown in Subsection 4.2, we employ the information of the spatial correlation to compress those redundant perturbation configurations. In addition, as shown in Subsection 5.1, we prune those insignificant perturbation configurations that lead to small skew changes (sensitivities) at output. Note that the key to solve the PECO problem depends on an efficient and accurate calculation of skew and sensitivity. In the following, we first build a compactly parameterized model for the clock tree, and then solve the skew and its sensitivity with respect to the perturbation by a structured and parameterized macromodel. #### 4. PARAMETERIZED PERTURBATION To calculate the worst-case skew, we first build an electrical circuit model for the clock tree. Each segmented clock wire is represented by a distributed $\pi$ -circuit with the capacitor in node and resistor in edge. The node capacitor includes both the wire and loading capacitance. #### 4.1 Parameterized System Equation Given the initial tree $\mathcal{T}$ with $N_l$ levels and N nodes, its system equation can be described by the Modified Nodal Analysis (MNA) in frequency-domain $$(G_0 + sC_0)x(s) = Bu (6)$$ $$y(s) = L^T x(s) \tag{7}$$ where $G_0$ and $C_0$ ( $\in \mathbb{R}^{N \times N}$ ) are the conductive and capacitive matrices for an initial tree, x(s) is the state variable for voltage, and y(s) is the output voltage. Note that x includes n sinks and m Steiner points, which can be represented by $$x = [x_{s_1}, \cdots, x_{s_n}, x_{Src}, x_{st_1}, \cdots, x_{st_m}]^T$$ (8) where $s_i, i = 1, \dots, n$ are sinks and $st_i, i = 1, \dots, m$ are Steiner points. Since the source node in the tree is the only input of the system, the input port matrix $B \ (\in \mathbb{R}^{N \times 1})$ becomes $$B = [\underbrace{0, \cdots, 0}_{n}, 1, \underbrace{0, \cdots, 0}_{m}]^{T}. \tag{9}$$ It selects the input with source u(s). In this paper, we assume an impulse input at the source node. In addition, because only the voltage responses at sinks are interested, the output port matrix L at level $i \in \mathbb{R}^{N \times n}$ ) becomes $$L^{i} = \begin{bmatrix} I_{n \times n} \\ 0_{(N-n) \times n} \end{bmatrix} \tag{10}$$ The according selected output y is $$y = [y_{s_1}, \cdots, y_{s_n}].$$ (11) It is in fact a SIMO (single-input-multi-output) system. When a clock wire experiences a temperature gradient, the unit-length resistance $r_{unit}$ is [10] $$r_{unit}(x, y, t) = \rho_0 \cdot [1 + \beta \cdot T(x, y, t)]. \tag{12}$$ where $\rho_0$ is the unit-length resistance at $0^{\circ}C$ , and $\beta$ is the temperature coefficient of resistance $(1/{}^{\circ}C)$ . When the embedding path $d(M'^{i}, s_k)$ is fixed, we calculate the new resistance by $$R(M^{\prime i}, s_k) = \sum_{\forall e \in d(M^{\prime i}, s_k)} E[r_{unit}(e)] \cdot len(e)$$ (13) where $E[r_{unit}(x, y)]$ is the mean value of resistance in edge $e(M^{\prime i}, s_k)$ The impact of temperature gradient is modeled by a perturbation that modifies the location of the merging point. As a result, there is a correspondent change in the conductance matrix G. Therefore, a perturbed MNA under all perturbation configurations becomes $$[(G_0 + \delta G_1) + s(C_0 + \delta C_1)] \cdot (x + \delta x_1) = Bu$$ $$[(G_0 + \delta G_2) + s(C_0 + \delta C_2)] \cdot (x + \delta x_2) = Bu$$ ... $$[(G_0 + \delta G_I) + s(C_0 + \delta C_I)] \cdot (x + \delta x_I) = Bu.$$ (14) Similar to the approach in [13], by organizing the expanded terms in the order of perturbations, and defining a new state variable $$x_P = [x, \delta x_1, \cdots, \delta x_I]^T \tag{15}$$ composed by nominal values and first-order perturbations, an augmented system equation could be obtained $$(G_P + sC_P)x_P = B_P u, \quad y_P = L_P^T x_P, \tag{16}$$ where $$G_{P} = \begin{bmatrix} G_{0} & 0 & \dots & 0 & 0 & 0 \\ \delta G_{1} & G_{0} & 0 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ \delta G_{i} & 0 & \dots & G_{0} & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \delta G_{I} & 0 & 0 & \dots & 0 & G_{0} \end{bmatrix}$$ (17) $$B_P = [B, \overbrace{0, \cdots, 0}^{I-1}]^T \tag{18}$$ and $$L_P = Diag(\overbrace{L_0, \cdots, L_0}^I). \tag{19}$$ Note that $C_P$ has a similar structure as $G_P$ . In addition, the output is $$y_P = [y, \delta y_1, \cdots, \delta y_I]^T, \tag{20}$$ where each $\delta y_i$ is a length-n vector, representing the voltage response and changes for each sink. As shown in Section 5, by solving the above system equation (16) in time-domain, the worst-case skew at each level can be calculated from the voltage response y(t). Figure 4: Clustering and perturbations within each cluster. #### 4.2 Parameter Clustering The parameterized representation in Section 3 is not compact, and it is impossible to count every possible perturbation configurations spatially and temporally. For example, a typical level in clock tree contains 200 merging points. If each merging point has 5 potential perturbations (move along North, South, East, West, Origin), there could be $I=5^{200}$ combinations parameterized perturbations at one time instant. On the other hand, because there exists temperature correlation at different regions, many merging points could have a similar perturbation configuration (See Figure 4). Therefore, an efficient worst-case skew calculation needs to first build a compactly parameterized model by applying the correlation information. Given pre-measured spatially and temporally variant temperature maps, one temperature correlation matrix $\mathcal{C}$ can be extracted for merging points in one level. As shown in Section 2, such a correlation matrix usually shows quite similar and large entries, which hence is usually low-rank. Physically, it infers that the local temperature values within the same group of merging points change in the similar fashion. Because we start from an initial balanced tree, the similar change in temperature as input can lead to the similar location change of those merging points. Therefore, if one can find the real-rank K of C, it may guide us to cluster the merging pints into K groups. It is well-known that Singular Value Decomposition (SVD) [14] can be used to find the rank-K approximation $$rank(K) = SVD(C, \epsilon_1)$$ (21) for a low-rank matrix, where $\epsilon_1$ is a threshold value to prune those smaller singular values. Therefore, the rank(K) is obtained when the first K singular values of $\mathcal C$ are larger than $\epsilon_1$ . As a result, we can infer that K subgroups of elements show a similar correlation. This can be performed as follows. A correlation graph is first constructed by assigning one edge to each pair of merging points $s_i, s_j$ with the weight $\mathcal{C}(i, j)$ . Then, a **K-means** partition [15] is performed to cluster the merging points at one level into K clusters $$[M_{c1}, M_{c2}..., M_{cK}] = K\text{-}means(K, C)$$ (22) where $\{c1, c2, ..., cK\}$ are K centroids of those K clusters to typically represent the perturbation behavior in each group. Accordingly, the compactly parameterized system equation becomes $$(G_{PC} + sC_{PC})x_{PC} = B_{PC}u, \quad y_{PC} = L_{PC}^T x_{PC},$$ (23) where $$G_{PC} = \begin{bmatrix} G_0 & 0 & \dots & 0 & 0 & 0 \\ \delta G_{c1} & G_0 & 0 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ \delta G_{ck} & 0 & \dots & G_0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \delta G_{cK} & 0 & 0 & \dots & 0 & G_0 \end{bmatrix}$$ (24) where the system size is roughly compressed from $(I+1)N \times (I+1)N$ to $(K+1)N \times (K+1)N$ (N>>K). As a result, lots of redundant perturbations in the search space are initially pruned. #### 5. WORST-CASE SKEW VIA MACROMODEL The augmented system has a larger size than the original. To efficiently calculate the skew and the skew sensitivity with respect to perturbation, the augmented system can be reduced by projecting with a small sized and othronormalizled matrix $V \ (\in R^{N \times q}, \ N >> q)$ [16, 13]. As it is a SIMO system, the reduction cost does not depend on the port number. #### **5.1** Sensitivity Based Pruning However, the projection matrix V in [16, 13] is flat. It can not separate the sensitivity from the nominal value during the projection. In this paper, instead of using the flat projection matrix V, a structured projection matrix is constructed similarly to [17, 18] $$\mathcal{V} = diag[V_0, V_1, \cdots, V_K], \tag{25}$$ where V is partitioned according to the dimension of x and $\delta x_i$ (i = 1, ..., K). Projecting by the new V, the reduced system becomes: $$\widetilde{G}_{PC} = \mathcal{V}^T G_{PC} \mathcal{V} \quad \widetilde{C}_{PC} = \mathcal{V}^T C_{PC} \mathcal{V} \quad (\in \mathbb{R}^{q \times q})$$ $$\widetilde{B}_{PC} = \mathcal{V}^T B_{PC} \quad \widetilde{L}_{PC} = \mathcal{V}^T L_{PC}. \tag{26}$$ The time-domain voltage response of the dimension reduced macromodel can be calculated by the Backward-Euler method $$(\widetilde{G}_{PC} + \frac{1}{h}\widetilde{C}_{PC})x_{PC}(t) = \frac{1}{h}\widetilde{G}_{PC}x_{PC}(t-h) + \widetilde{B}_{PC}u(t)$$ $$y_{PC}(t) = \widetilde{L}_{PC}^{T}x_{PC}(t). \tag{27}$$ Note that the reduced system matrix has a preserved blocktriangular structure and can be efficiently solved by factorizing only the diagonal block ( $G_0$ and $G_0$ ). Moreover, there is no need to construct a large augmented state matrix when the block matrix data structure is implemented. Accordingly, the skew and skew sensitivity can be calculated separately $$\widetilde{y}_{PC} = [\widetilde{y}, \delta \widetilde{y}_1, \cdots, \delta \widetilde{y}_K]^T, \tag{28}$$ and the total voltage response in each sink under the perturbation configuration $PC_j^i$ at level i $(i = 1, ..., N_l, j = 1, ..., K)$ is $$\widetilde{y}_{PC_{i}^{i}}(t) = \widetilde{y}^{i}(t) + \delta \widetilde{y}_{j}^{i}(t).$$ (29) The algorithm performs in a bottom-up fashion by solving $y^1, y^2, ..., y^{N_l}$ level by level sequentially. Note that the computational cost will grow exponentially if we preserve all perturbation configurations during the update. To reduce the computational cost, we further prune those insignificant perturbations to change the skew when updating from level i to level i+1. This is the second pruning in addition to the one by SVD compression in Section 4.2. It removes one perturbation that could result in an insignificant levelized delay change $(\widetilde{y}_j^i)_{t_0 \to t_p}$ compared to the nominal delay $(\widetilde{y})_{t_0 \to t_p}$ by $$\frac{|\delta \widetilde{y}_j^i|_{t_0 \to t_p}}{|\widetilde{y}^i|_{t_0 \to t_p}} < \epsilon_2. \tag{30}$$ $\epsilon_2$ is specified by user and $t_0$ , $t_p$ define the period of the evaluation. Therefore, only those significant perturbations $\delta G^i_j$ are propagated to the level i+1. As a result, the overall computational cost could be further reduced. #### 5.2 Worst-case Skew Following the conventional definition for the propagation delay, $D_{(Src \leadsto s_i)}$ , from the source node Src to sinks $s_i$ , is the time required for the node voltage (waveform) to pass 65% of the peak voltage under the impulse excitation in the source node. After obtaining the source to sink delay of j-th perturbation configuration $PC^i_j$ in level i, we can calculate the worst case skew corresponding to $PC^i_j$ as follows $$Skew_i = \max_{\forall s_k} D_{(Src \leadsto s_k)} - \min_{\forall s_k} D_{(Src \leadsto s_k)}.$$ (31) The worst-case skew is then determined by those preserved perturbations from all levels. #### Algorithm 1 PECO Algorithm ``` Input: source, sinks, initial tree \mathcal{T} and correlation \mathcal{C} Output: A re-embedded clock tree T' 1: (N_l \text{ Levels}) \leftarrow \text{Levelize } \mathcal{T} {Bottom up embedding from the second last level to the sec- ond level} for i = N_l - 1 to 1 do PC^i \leftarrow Perturbation configurations in level i 3: for \forall p_i^i \in PC^i do R_i^i, C_i^i \leftarrow \text{Extract R/C of the updated tree} 5: 6: Obtain parameterized state description (16) 7: Prune-1: Remove redundant perturbation configurations by (21) and (22) x_{s_k}, \forallsink s_k \leftarrow Solve the system (x \text{ and } \delta x_i) based on (27) 8: 9: Prune-2: Remove insignificant perturbation configura- D_{Src \leadsto s_k}, \forall \text{sink } s_k \leftarrow \text{Calculate source-sink delay} 10: 11: Calculate worse-case skew under PC^i by (31) 12: Embed in level i \leftarrow (PC'^i = \min_{\forall PC_i^i} Skew_j) 13: 14: end for 15: return T' ``` Figure 5: Singular-value distribution of the correlation matrix. x-axis is the order of the singular values. #### 6. ALGORITHM AND RESULTS #### 6.1 Overall Algorithm In summary, the flow of the overall PECO algorithm is presented in Algorithm 1. After a DME-initialized clock tree construction, the re-embedding by perturbation is performed as follows. The worst-case skew and re-embedding are determined in a bottom-up fashion. At each level, the merging points are clustered according to their correlation strength. Then the resulting clusters are perturbed with five possible perturbation configurations, and those that could cause large skew changes (sensitivity) are selected for re-embedding. Due to the bounded perturbations, i.e., the perturbation radius is confined, as shown by experiments, the overhead of wirelength is negligible. #### **6.2** Experimental Results The proposed PECO algorithm, including the parameter compression and the structured and parameterized reduction, are implemented in C++ and Matlab. All experimental data is measured on a Linux server with 1.9GHz P4-CPU Figure 6: Given 100 temperature maps, compare skew distribution of PECO on benchmarks r1 to r4 with ZST/DME. Note that the scale for skew is non-uniform. and 2Gb memory. ZST/DME [2] is used for the comparison. A chip with size $6cm^2$ is divided into a uniform grid with $50 \times 50$ nodes to sample the non-uniformly distributed temperature map. The temperature maps at nodes of the grid are obtained from a micro-architecture level power and temperature transient simulator [12] by applying six SPEC2000 applications (art, ammp, compress, equake, gcc, gzip). The macromodel (6th-order) is used to generate both the transient voltage response, and then to calculate the worst-case skew and its sensitivities under perturbations. All simulations use the unit resistance $r_0 = 0.03\Omega/mum$ , unit capacitance $c_0 = 2.0 \times 10^{-16} F/mum$ and $\beta = 0.0068$ [4], where $\beta$ is the temperature coefficient of resistance. The clock-tree r1-r5 in [3] are used as the benchmark. The initial tree is constructed by the ZST/DME method. The above six applications lead to a temperature variation about $50^{\circ}C$ . We collect 100 temperature maps by simulating these applications in a sequence and recording temperature maps for every 10 million clock cycles. One correlation matrix is then extracted. After obtaining the searching radius, we generate perturbation configurations for the 100 temperature maps. To compress the redundant parameterized perturbations, SVD is applied to explore the rank of correlation matrix C. Figure 5 shows a distribution of singular values. Clearly, the dominant singular values of $\mathcal{C}$ decay sharply in a deceasing order. Thus a low rank representation of perturbed merging points could yield a good approximation. In this case, we choose a rank-5 approximation and apply 5-means partition to compress the parameterized representation. In Table 2, we report the worst-case skew produced by ZST/DME in both Elmore model and high-order macromodel as used by PECO. PECO minimizes the worst-case skews for all r1-r5 circuits, and reduces the worst-case skew by up to 5X (0.65ns vs. 2.73ns) compared to ZST/DME. | Ckt | worst skew(ps) | | | wirelength(um) | | |----------|----------------|-----------|-----------|----------------|---------| | (sink#) | DME | | PECO | DME | PECO | | | Elmore | 6th-order | 6th-order | | | | r1(267) | 50.5 | 52.3 | 38.0 | 1.30e06 | 1.32e06 | | r2(598) | 176.0 | 180.5 | 52.8 | 2.59e06 | 2.60e06 | | r3(862) | 285.2 | 290.1 | 94.4 | 3.37e06 | 3.38e06 | | r4(903) | 620.2 | 652.8 | 206.5 | 6.81e06 | 6.82e06 | | r5(3101) | 2730 | 3140 | 650 | 1.01e07 | 1.02e07 | Table 2: The worst-case skew and wirelength comparison for PECO and DME on benchmarks r1 to r5. | Ckt | runtime (s) | | | | | |-------------|-------------|-------------|----------|--|--| | $(\sinh\#)$ | DME | PECO | | | | | | | build-model | opt-skew | | | | r1 (267) | 0.5 | 1.1 | 0.2 | | | | r2 (598) | 1.0 | 6.9 | 0.6 | | | | r3 (862) | 1.2 | 16.9 | 1.3 | | | | r4 (903) | 4.1 | 43.0 | 1.1 | | | | r5 (3101) | 7.2 | 164.4 | 1.4 | | | Table 3: Runtime comparison for PECO and DME on benchmarks r1 to r5. Note that we calculate the worst-case skew still using the 100 temperature maps. We then plot the calculated skew distributions using 100 temperature maps in Figure 6. All clock trees optimized by PECO have smaller worst-case skew than ZST/DME does, where the reported skew is based on the high-order macromodel used in PECO. The averaged skew reduction by PECO is 3.5X, and the skew distribution is also narrower. Moreover, Figure 7 compares the initial clock tree and the re-embedded one for r3. Table 2 also compares the wirelength between PECO and ZST/DME. Because of re-embedding the overall wirelength of PECO is larger than ZST/DME. But due to a bounded perturbation, the overhead is less than 1% on average. The runtime of PECO is reported in Table 3. The runtime includes the time to build macromodel and the one to find optimized merging point. Due to the macromodel construction time, its overall runtime is larger than ZST/DME when Elmore model is used in ZST/DME. Note that the construction of macromodel is one-time and the reduced macromodel can be reused under different inputs. #### 7. CONCLUSIONS Existing clock tree synthesis [1, 2, 3] did not consider the extra skew caused by the temperature gradient, or [4] needed to assume a time-invariant worst-case temperature map to find the worst-case skew. In this paper, we propose a minimal skew clock tree embedding that considers the time-variant temperature gradient with correlation. Parameterized perturbation are used to model the physical location changes of merging points. We decide the merging point level by level based on the sensitivity of the skew with respect to the change of merging point. Considering the temperature correlation, such sensitivity is calculated by a parameterized model, which is compressed by the SVD and K-means methods. The experimental results show that our algorithm reduces worst-case skew by up to 5X compared to the existing zero-skew based approach ZST/DME [2, 3] with small (up to 1%) wirelength overhead. In the future, we plan to extend our method to handle process and supply-voltage variations introduced clock skew. Clock Tree r3: 1724 nodes Figure 7: Initial clock tree (shown in black dashline), and optimized clock tree (shown in red dotline) after PECO of r3 #### 8. REFERENCES - R.-S. Tsay, "An exact zero skew clock routing algorithm," *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, pp. 242–249, 1993. - [2] T. H. Chao, Y-C.Hsu, J-M.Ho, K.D.Boese, and A. Kahng, "Zero skew clock routing with minimum wirelength," *IEEE Trans. Circuits Syst. II*, pp. 799–814, 1992. - [3] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. A. Tsao, "Bounded-skew clock and Steiner routing," ACM Trans. Design Automation of Electronics Systems, 1997. to appear. - [4] M. Cho, S. Ahmed, and D. Z. Pan, "TACO: Temperature aware clock-tree optimization," in Proc. Intl. Conf. Computer-Aided Design, 2005. - [5] J. Tsai, L. Zhang, and C. C. Chen, "Statistical timing analysis driven post-silicon-tunable clock-tree synthesis," in *Proc. Intl. Conf. Computer-Aided Design*, 2005. - [6] U. Padmanabhan, J. M. Wang, and J. Hu, "Statistical clock tree routing for robustness to process variations," in *Proc. Intl. Symp. Physical Design*, 2006. - [7] C. C. Teng, Y. K. Cheng, E. Rosenbaum, and S. M. Kang, "iTEM: A temperature-dependent electromigration reliability diagnosis tool," *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, pp. 882–893, 1997. - [8] K.Banerjee, A.Mehrotra, A.Sangiovanni-Vincentelli, and C. Hu, "On thermal effects in deep sub-micron VLSI interconnects," in *Proc. Design Automation Conf.*, 1999. - [9] T.-Y. Chiang, B. Shieh, and K. Sarawat, "Impact of Joule heating on scaling of deep sub-micron Cu/low-k interconnects," in *Symposium on VLSI Technology*, 2002. - [10] A. H. Ajami, K. Banerjee, and M. Pedram, "Modeling and analysis of nonuniform substrate temperature effects on global ULSI interconnects," *IEEE Trans.* Computer-Aided Design of Integrated Circuits and Systems, pp. 849–861, 2005. - [11] K. Skadron, M. Stan, and et. al., "Temperature-aware microarchitecture," in Proc. Intl. Symp. on Computer Architecture, 2006. - [12] W. Liao, L. He, and K. Lepak, "Temperature and supply voltage aware performance and power modeling at microarchitecture level," *IEEE Trans.* Computer-Aided Design of Integrated Circuits and Systems, pp. 1042–1053, 2005. - [13] X. Li, P. Li, and L. Pileggi, "Parameterized interconnect order reduction with explicit-and-implicit multi-parameter moment matching for inter/intra-die variations," in *Proc. Intl. Conf. Computer-Aided Design*, 2005. - [14] G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore, MD: The Johns Hopkins University Press, 3 ed., 1989. - [15] J. B. MacQueen, "Some methods for classification and analysis of multivariate observations," in *Proceedings* of 5-th Berkeley Symposium on Mathematical Statistics and Probability, 1967. - [16] A. Odabasioglu, M. Celik, and L. Pileggi, "PRIMA: Passive reduced-order interconnect macro-modeling algorithm," *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, pp. 645–654, 1998. - [17] H. Yu, L. He, and S. Tan, "Block structure preserving model reduction," in *IEEE International Workshop on Behavioral Modeling and Simulation (BMAS)*, 2005. - [18] H. Yu, Y. Shi, and L. He, "Fast analysis of structured power grid by triangularization based structure preserving model order reduction," in *Proc. Design Automation Conf.*, 2006.