

Contents lists available at SciVerse ScienceDirect

# INTEGRATION, the VLSI journal



journal homepage: www.elsevier.com/locate/vlsi

# SPECO: Stochastic Perturbation based Clock tree Optimization considering temperature uncertainty $\stackrel{\scriptscriptstyle \rm tree}{\sim}$

Sina Basir-Kazeruni<sup>a</sup>, Hao Yu<sup>b,\*</sup>, Fang Gong<sup>a</sup>, Yu Hu<sup>a</sup>, Chunchen Liu<sup>a</sup>, Lei He<sup>a</sup>

<sup>a</sup> Department of Electrical Engineering, University of California at Los Angeles, Los Angeles, CA 90095, USA
 <sup>b</sup> School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639669, Singapore

## A R T I C L E I N F O Available online 6 May 2012

Robust clock-tree synthesis

Stochastic clock-skew optimization

Temperature uncertainty

Keywords:

ABSTRACT

Modern computing system applications or workloads can bring significant non-uniform temperature gradient on-chip, and hence can cause significant temperature uncertainty during clock-tree synthesis. Existing designs of clock-trees have to assume a given time-invariant worst-case temperature map but cannot deal with a set of temperature maps under a set of workloads. For robust clock-tree synthesis considering temperature uncertainty, this paper presents a new problem formulation: Stochastic PErturbation based Clock Optimization (SPECO). In SPECO algorithm, one nominal clock-tree is presynthesized with determined merging points. The impact from the stochastic temperature variation is modeled by perturbation (or small physical displacement) of merging points to offset the induced skews. Because the implementation cost is reduced but the design complexity is increased, the determination of optimal positions of perturbed merging points requires a computationally efficient algorithm.

In this paper, one Non-Monte-Carlo (NMC) method is deployed to generate skew and skew variance by one-time analysis when a set of stochastic temperature maps is already provided. Moreover, one principal temperature-map analysis is developed to reduce the design complexity by clustering correlated merging points based on the subspace of the correlation matrix. As a result, the new merging points can be efficiently determined level by level with both skew and its variance reduced. The experimental results show that our SPECO algorithm can effectively reduce the clock-skew and its variance under a number of workloads with minimized wire-length overhead and computational cost. © 2012 Elsevier B.V. All rights reserved.

### 1. Introduction

Clock-tree synthesis with timing verification is always one crucial design step for high-performance VLSI circuits [1,2]. The deployment of technology scaling has resulted in a large nonuniform power dissipation, which further causes a non-uniform temperature gradient, i.e., spatial temperature variation over the entire chip. [3–9]. As clock net is routed over the chip by global interconnect, the temperature gradient can bring a significant skew variation and hence can cause functionary failure. [10,5,11–13]. As such, the traditional clock-tree synthesis methods [14–16] without considering temperature impact would become inaccurate.

Given a set of sinks (flip-flops), the clock-tree synthesis is to find a topology and embedding with the minimized wire length and mismatch of arrival times, called *skew*. Wire-length

\* Corresponding author.

minimizations under a zero or bounded skew constraint were developed in [14–16]. Those balanced skew points for embedding are called merging points. Fig. 1 shows the general clock-tree synthesis flow. Note that the previous approaches have not considered the skew introduced from the variations of process, supply voltage, and temperature (PVT). PVT variations can shift the ideal merging points from the nominal positions.<sup>1</sup> Considering the skew caused by the temperature gradient, TACO [10] is the pioneer work that deploys re-embedding method for clock-tree synthesis. When the worst-case temperature map is given under one specific workload, one can minimize the worst-case skew by searching in a merging diamond. Compared to the other approaches by adjusting buffer size or by adding linked-resistor [17-19], TACO does not introduce additional cost of power consumption and physical implementation. However, it is unknown that how TACO identifies the worst-case temperature map that can lead to the worst-case skew.

 $<sup>^{\</sup>star}$  Some preliminary results of this paper have appeared in Proceedings of the International Symposium on Physical Design (ISPD'07). The work is sponsored by Singapore MOE-TIER2 ARC 5/11 grant.

*E-mail addresses:* haoyu@ntu.edu.sg, hy255@ee.ucla.edu (H. Yu), lhe@ee.ucla.edu (L. He).

<sup>0167-9260/\$ -</sup> see front matter @ 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.vlsi.2012.04.004

<sup>&</sup>lt;sup>1</sup> Though this paper is focused on the skew minimization caused by the temperature variation, the similar procedure can be extended to consider other environmental variations.



Fig. 1. Diagrams of clock-tree synthesis of topology mapping and layout embedding.

More importantly, in order to maintain the chip temperature for device reliability, especially when more computing cores are integrated together, there is an increasing use of dynamic power management (DPM) in modern VLSI designs [20-24]. DPM can be performed in deterministic or stochastic fashion. To consider the uncertainty introduced by the real-time input of workloads (or a set of applications), DPM can be deployed as a stationary stochastic process using Markov chains [24] for an optimal decision. As such, the uncertainty can be transferred to the uncertainty of the cycleaccurate power, thermal power and temperature map. Such an observation implies that if skew is optimized based on the timeinvariant temperature map for one application, the resulting embedding is unlikely optimized and even may lead to an excessive skew for other applications. A robust clock-tree synthesis thereby needs to consider the stochastic temperature behavior with the presence of the workload uncertainty.

The primary contribution of our paper is a new problem formulation of the clock-tree synthesis with the consideration of the temperature uncertainty. Instead of using one assumed worst-case temperature map [10], we propose to consider a set of temperature maps applied in a stochastic fashion. Accordingly, we develop a Stochastic PErturbation based Clock-tree Optimization (SPECO) to adjust the perturbed merging points and reembed them to minimize the stochastic clock-skew and its variation, instead of the worst-case clock-skew. Given a set of stochastic temperature maps, the optimization flow of SPECO starts with an initially balanced clock-tree, and then minimizes the stochastic clock-skew and its variation level by level driven by the clock-skew sensitivity, with respect to the displacement of the merging point. To speed up the whole synthesis flow, fast computational algorithms can be employed. Our initial work developed in [11] has applied a structured and parameterized macromodel for the clock-tree design. This paper improves [11] from twofold. Firstly, we have formulated a true stochastic problem formulation to reduce both skew and skew variance considering temperature uncertainty, and one Non-Monte-Carlo (NMC) method [25,26] is developed to handle stochastic temperature maps. In contrast, the approach in [11] is still performed in a deterministic or corner-based fashion. Secondly, we have applied a new principal temperature-map analysis (PTA) to identify the dominant temperature-relevant parameters in a more effective and efficient fashion. For example, the dominant parameters are identified from the subspace of the correlation matrix in this paper, instead of using the correlation matrix itself [11]. This modification has shown more accurate and efficient results. The experimental results show that when compared to DME, SPECO can reduce the skew by  $6.08 \times$  with wire-length overhead 1% on average, and TACO can only reduce the skew by 3.93 × with wirelength overhead 6% on average. Moreover, only SPECO can further reduce the skew variance. Note that the computational complexity is also reduced in this new SPECO algorithm. For example, with the use of NMC. SPECO only requires one-time stochastic analysis but DME and TACO both need to search the worst-case skew from 1000 temperature maps. In addition, the new PTA in SPECO also improves the quality of the synthesized tree and reduces the runtime.

The remaining part of this paper is organized by the following manner. The problem formulation of SPECO is first described in Section 2. The stochastic calculation of clock-skew and its variation are discussed in Section 3. The according stochastic-sensitivity based clock-tree optimization with the utilization of the compactly parameterized macromodel is presented in Section 4. The experiment result is summarized in Section 5, and the paper is concluded in Section 6.

### 2. Clock tree synthesis considering stochastic temperature

In this section, we elaborate the background of temperature model and temperature uncertainty and the need of a stochastic temperature model. We then present an according problem formulation called, Stochastic Perturbation based Clock-tree Optimization (SPECO). For a clear presentation of mathematical related derivation, we first summarize the symbols and terms in Table 1.

As the focus of this paper is mainly for clock-tree synthesis with consideration of temperature, the physical impact to skew from wire is thereby the target under consideration. However, this does not mean that the temperature impact to clock-buffer cannot be considered in this framework. In fact, if one can have the clock-tree designs with determined clock-buffer size, the

**Table 1**Table for definitions and terms.

| Variable               | Physical meaning                                                    |
|------------------------|---------------------------------------------------------------------|
| $P_i/P_i^{th}$         | Electrical/thermal power at node <i>i</i>                           |
| $r_i/r_i^{th}$         | Electrical/thermal resistance at node <i>i</i>                      |
| $\delta T_i$           | Nominal of temperature-gradient at node <i>i</i>                    |
| $\delta \hat{T}_i$     | Mean of temperature-gradient at node <i>i</i>                       |
| $\sigma_i$             | Variance of temperature-gradient at node <i>i</i>                   |
| $\Sigma(i,j)$          | Correlation of temperature-gradient between node $i$ and $j$        |
| ζ                      | Random variable for stochastic temperature gradient                 |
| M/M'                   | Original/perturbed merging points at one level                      |
| $l_k(M,s_k)$           | Embedding path from $M$ to one sink $s_k$                           |
| PC                     | Perturbation configuration at one level                             |
| $S(i,j)/\delta S(i,j)$ | Skew/skew-variance between nodes <i>i</i> and <i>j</i> at one level |
| $G_0/C_0$              | Conductive/capacitive state matrix                                  |
| $G(\xi)/C(\xi)$        | Stochastic conductive/capacitive state matrix                       |
| $A(\xi)$               | Combined stochastic state matrix                                    |
| δΑ                     | Perturbed state matrix                                              |
| x/y                    | Voltage state variable/output                                       |
| B/L                    | Input/output port matrix                                            |
| $\delta G_i$           | Conductive change by <i>PC<sup>i</sup></i>                          |
| $\Phi$                 | Stochastic orthogonal polynomials                                   |
| Κ                      | Principal parameter number                                          |

temperature-dependent delay contributed from the clock-buffer can be included to optimize by the same approach.

### 2.1. Temperature model

Let us first review the electro-thermal model in this part. The overall chip is assumed to be uniformly divided into N tiles. Accordingly, for one piece of metal of clock at *i*th tile (i = 1, ..., N), its dissipated thermal power  $P_i^{th}$  is defined by averaging the cycleaccurate  $P_i$  at the thermal-time-constant scale *t*th

$$P_i^{th} = \frac{1}{t^{th}} \int P_i(t) dt.$$
(1)

Such a thermal power becomes the source to increase the surrounding temperature. According to its electrical analogy, one can have the increased temperature  $\delta T_i$  by

$$\delta T_i = r_i^{th} P_i^{tn},\tag{2}$$

where  $r_i^{th}$  is the thermal resistance at *i*th tile.

Note that the electron mobility depends on the temperature. The mobility decreases with increased temperature and hence this leads to increased resistance described below:

$$r_i(\delta T_i) = r^0 (1 + \beta \cdot \delta T_i) \tag{3}$$

where  $r^0$  is the resistance at the initial temperature for one tile, and  $\beta$  is the temperature coefficient (1/°C). The increased electrical resistance further leads to increased delay and skew for clock. As such, to understand the clock-skew induced by temperature, one needs to first trace the root to the power consumption.

### 2.2. Temperature uncertainty

Considering a set of stochastic temperature maps for clocktree synthesis is important. In contrast, optimizing clock-tree under one temperature map for one application [10] can be nonoptimal for other applications. Let us illustrate this by the following example. We assume a micro-architecture level power and temperature simulator [23] using alpha architecture with total  $\mathcal{N}$  tiles. We also assume that a set of workloads with applications (ammp, art, compress, equake, gzip, gcc) from SPEC2000 benchmark are deployed. The thermal power is defined by averaging the cycle-accurate (scale of ps) power in the thermal-constant scale (scale of *ms*). Using this thermal power as input, the steady-state temperature over the chip can be estimated at one tile in the grid. Fig. 2 shows the temperature maps for two different applications at selected time instants, where (a) is for application: ammp at 11-million cycles and is for application: gzip at 15-million cycles. Clearly, their temperatures

### b а 200 140 120 150 100 Temp(C) Femp(C) 80 100 60 40 50 10 10 10 5 0 0 0 0

**Fig. 2.** The temperature gradient over the chip after 10 million cycles. (a) is for application:ammp and (b) is under application:gzip.

are quite different and hence each can lead to a significantly different skew variation.

To maintain a robust performance for the modern VLSI system, especially the mobile system, dynamic power management (DPM) is employed to achieve energy-efficient computation by selectively scheduling the applications or called workloads during one operation period [20-24]. There are two types of DPM policies: predictive DPM scheduler and stochastic scheduler. To obtain the globally optimized solution, the stochastic DPM is considered in this paper by assuming the DPM scheduler as a stationary stochastic process. A set of workloads will be scheduled to executed stochastically. The according cycle-accurate power. thermal-power and temperature map defined in (1) and (2)thereby become stochastic. As such, one can assume one type of stationary stochastic distribution with random variable  $\xi$  for each tile temperature  $\delta T_i$  in an interval  $[\delta T_i^{min}, \delta T_i^{max}]$ , where  $\delta T_i^{min}$  and  $\delta T_i^{max}$  are determined by taking the minimum and maximum temperatures from a set of temperature maps. Note that only one random variable  $\xi$  is assumed as the stochastic DPM scheduler is applied with the same behavior to all tiles when one set of workloads are given. Moreover, this stochastic process can be assumed as stationary since the DPM scheduler takes action only when the temperature becomes stabilized at the time-scale of thermal-time-constant for one workload.

Therefore, one can use a stochastic distribution associated with a random variable  $\xi_i$  for the temperature change  $\delta T_i$  at each tile (i = 1, ..., N). Assuming that there are *X* sampled temperature maps at each tile, the stochastic temperature distribution at *i*th tile becomes

$$\delta T_i(\xi) = \hat{\delta T}_i + \sigma_i \xi, \tag{4}$$

where

$$\hat{\delta T}_{i} = \sum_{k=1}^{X} \delta T_{i}^{k} / X, \quad \sigma_{i} = \sqrt{\sum_{k=1}^{X} (\delta T_{i}^{k})^{2} / X - (\hat{\delta T}_{i})^{2}}$$
(5)

defines the mean and the variance. The random variable for stochastic temperature gradient,  $\xi$ , follows a Gaussian distribution with mean of zero, and its variance depends on different temperature variations/ perturbations. Note that though only one random variable  $\xi$  is assumed for all tiles. The stochastic distribution at each tile can be still different since both  $\delta T_i$  and  $\sigma_i$  can be different. When Gaussian distribution is further assumed, the distribution at each tile is mainly distinguished by  $\sigma_i$ . Moreover, the correlation between *i*th tile and *j*th tile can be described by

$$\Sigma(i,j) = \frac{\left(\frac{1}{X}\sum_{k=1}^{X}\delta T_{i}^{k}\delta T_{j}^{k}\right) - \left(\frac{1}{X^{2}}\sum_{k=1}^{X}\delta T_{i}^{k}\sum_{k=1}^{X}\delta T_{j}^{k}\right)}{\sigma_{i} \cdot \sigma_{j}},\tag{6}$$

which forms the correlation matrix  $\Sigma$  ( $\in \mathbb{R}^{N \times N}$ ). Based on the above definitions, we can develop a stochastic clock-skew analysis and optimization in the following part of the paper.

### 2.3. SPECO problem formulation

In this part, we discuss how to robustly embed the clock tree and reduce the clock skew introduced by the temperature variation. Similar to TACO, our starting point is a balanced-skew clocktree  $\mathcal{T}$  obtained by either differed-merge embedding (DME) or bounded-skew clock-routing (BST) method [27,16] with a uniform temperature map as the initial condition. The temperature impact is modeled by a perturbation to the merging point. As such, one can reduce the cost of power and physical resource by simply adjusting the position of one merging point. However, the assumption of one given temperature map for the worst-case skew [10] does not hold under dynamic power management. As a set of workloads can be scheduled stochastically for one given evaluation period, we expect to observe a stochastic temperature distribution  $\xi_i$  at each tile. Therefore, the adjustment of one merging point needs to reduce the clock-skew under a set of temperature maps. Our initial work in [11] assumes the use of the mean temperature to take into account on a set of temperature maps. However, clock-skew analysis and optimization under the mean temperature is still inaccurate for an optimal solution for all types of workloads. In addition, the variation (variance) of clock-skew is also not considered. This becomes the motivation for this paper to develop a stochastic perturbation based clock-tree optimization (SPECO) algorithm as follows.

We first list a few relevant foundations used in the following paper.

**Definition 1.** Given a balanced-tree with one source node  $n_{src}$ ,  $\mathcal{N}_l$  levels and  $\mathcal{N}_n$  nodes, level i  $(i = 1, ..., \mathcal{N}_l)$  is a node-set with  $1, ..., n^i$  nodes, a merging-point set with  $1, ..., n^i$  merging-points, and a path-set with  $1, ..., l^i$  embedding-paths. The embedding-path  $l(M, s_k)$  is from one node  $s_k$  ( $\forall s_k \in n^i$ ) to one merging-point M ( $\forall M \in m^i$ ). After perturbation, re-embedding path is an embedded path  $l(M', s_k)$  from M' to node  $s_k$ . Note that  $\mathcal{N}_n = \sum_{i=1}^{N_l} n^i$ .

**Definition 2.** *M* can be perturbed by a constant distance d  $(d = n \times d_0)$  along four Manhattan directions (North, South, West, East). Note that  $d_0$  is the tile width and *n* is one specified integer parameter as the step of displacement. It is also possible that the merging point *M* can happen to remain unchanged for its physical location. As such, there are totally five types of possible perturbation displacements for *M* during one iteration.

Based on [10,11], we assume that the merging point (or the balanced skew point) M of the initial tree T is locally adjusted level by level in a bottom-up fashion and the bottom one is at the 1st level. At one level (except for the bottom one), the perturbed merging-point M' from the original merging-point M of one pair of nodes  $(n_1, n_2)$  is assumed to locate in a bounded region centered at M. As such, one needs to identify the location of M', which can minimize the clock-skew and avoid the wire-length overhead. This can be achieved through a number of local refinement iterations as follows.

Moreover, as shown in Fig. 3, one needs to identify a sequence of perturbation displacements of M to reach M', which can eventually achieve a minimum clock-skew and its variation. After the determination of the merging point M', one can further decide the pair of embedding-paths l from M' to the pair of nodes  $(n_1, n_2)$ . By performing the same procedure level by level in a bottom-up



Fig. 3. Five perturbation configurations for one merging point and one perturbation displacement for a sequence of perturbed merging points.

fashion, one can determine a new clock-tree topology with newly balanced merging points considering the temperature perturbation. As a result, we can formulate our clock-tree problem as follows

| minimize   | Wirelength                                       |  |  |  |
|------------|--------------------------------------------------|--|--|--|
| subject to | $S(i,j) \le s_0$<br>$\delta S(i,j) \le s_\sigma$ |  |  |  |

**Formulation 1.** Given the source node  $n_{src}$ , the sink nodes  $s_1 \cdots s_n$  at bottom level, initial topology of one clock-tree (with embedding) by DME/BST, and a number of temperature maps under different applications, the temperature  $\delta T_i$  at ith tile is assumed to have a stochastic distribution described by random variable  $\xi$  ( $\delta \hat{T}_i, \sigma_i$ ). Our Stochastic Perturbation based Clock-tree (SPECO) algorithm is to find a re-embedded clock-tree to minimize the additional perturbation displacement (or Wirelength) of merging points at one level such that the clock-skew S(i,j) and its variance  $\delta S(i,j)$  between nodes i and j satisfy the budget  $s_0$  and  $s_{\sigma}$ , respectively.

This problem formulation considers the stochastic temperature maps but not an averaged temperature map, and is hence different from our previous work in [11]. The overall flow to solve this problem is briefly explained as follows. Our SPECO algorithm is performed level by level in a bottom-up fashion. Fig. 3 only demonstrates the procedure for one merging-point to find the location of the finalized merging point M' and its associated new embedding path.

### 2.4. Challenges to solve SPECO problem

Note that each perturbation displacement has five candidates, and there are  $5^{n^i}$  possible perturbation combinations of displacements if there are  $n^i$  merging points at *i*th level. This paper defines the design parameter as every possible perturbation combination (PC) for merging-points at one level. So during one optimization step, one PC is selected if the resulted output change, or sensitivity, can lead to the minimum skew and skew variance. The merging-point continues to move and the PC selection continues to proceed till the budgets of the skew and skew variation at one level are both satisfied. After identifying the PC at the current level, the solution is propagated to the higher level and the same optimization process is repeated.

However, there are three primary difficulties encountered here. Firstly, an efficient stochastic analysis is needed to calculate both clock-skew and its variation under the temperature uncertainty. The Monte-Carlo simulation needs to be performed many times repeatedly and hence is slow. Hence non-Mote-Carlo simulation is needed to perform the analysis just once. Secondly, one needs to identify a small number of 'representative' perturbation combinations. Blindly trying each PC is computationally impossible. One needs to classify parameters into a few 'representative' clusters for this purpose. Lastly, the determination of the sequence of the perturbation combination needs to follow a track driven by the sensitivity.

Note that the key to solve the SPECO problem depends on an efficient and accurate calculation of clock-skew and sensitivity. In the following, we first present an efficient Non-Monte-Carlo variational analysis of clock-skew.

### 3. Calculation of stochastic clock-skew and variation

### 3.1. State equation of clock tree

To calculate the clock-skew and its variation, we need to first build an electrical circuit model for the clock-tree. Recall that the chip is uniformly divided into N tiles in this paper. As such, each

segmented clock wire in one title is represented by one *RC*  $\pi$ -element, with two grounded capacitors at two nodes and one resistor at one edge. Note that the node capacitor includes both the wire and loading capacitance.

Given the initial clock-tree topology T with  $N_l$  levels and  $N_n$  nodes, its electrical state equation can be described by Modified Nodal Analysis (MNA) in time-domain as

$$G_0 x(t) + C_0 \frac{dx(t)}{dt} = Bu(t),$$
  
$$y(t) = L^T x(t),$$
 (7)

or in frequency-domain (s) as

. ..

 $(G_0 + sC_0)x(s) = Bu(s),$ 

$$y(s) = L^T x(s), \tag{8}$$

where  $G_0$  and  $C_0$  ( $\in \mathbb{R}^{N_n \times N_n}$ ) are the conductive and capacitive matrices for the initial clock-tree  $\mathcal{T}$ , x is the state variable for node voltage, y is the output nodes to observe, and B and L are topological matrices describing how to connect inputs and outputs to the network, or called input and output port matrices.

Moreover, the total  $N_n$  nodes here include: one source nodes, w sink nodes and z Steiner nodes. As such, the vector of state variable x can be represented by

$$x = [x_{s_1}, \dots, x_{s_w}, x_{st_1}, \dots, x_{st_z}, x_{src}]^l,$$
(9)

where  $s_i$  (i = 1, ..., w) is the sink node and  $st_i$  (i = 1, ..., z) is the Steiner node.

Furthermore, since the input signal is applied only to the source node in the clock-tree, the input port matrix B ( $\in \mathbb{R}^{N_n \times 1}$ ) becomes

$$B = \left[\underbrace{0, \dots, 0}_{w}, \underbrace{0, \dots, 0}_{z}, 1\right]^{T},$$
(10)

which connects the input source u(s) to the clock-tree. In this paper, we assume an impulse input at the source node.

In addition, since the voltage responses at sink nodes are desired to observe, the output port matrix  $L (\in R^{N_n \times w})$  becomes

$$L^{I} = [\mathbf{I} \ \mathbf{0}], \tag{11}$$

where **I** is  $\in R^{w \times w}$ , and **0** is  $\in R^{(\mathcal{N}_n - w) \times w}$ . The accordingly selected output *y* becomes

$$y = [y_{s_1}, \dots, y_{s_w}].$$
 (12)

Such a state equation for the clock-tree is in fact a so-called SIMO (single-input-multi-output) system.

Note that the clock-tree resistance observed at one node  $s_k$  (state-variable  $x_k$ ) ( $k \in 1, ..., N_n$ ) is given by a summation along one embedding path  $l_k(M, s_k)$  from the merging point M to the node  $s_k$ 

$$R_k(M, s_k, T) = \sum_{\forall i \in I_k(M, s_k)} r_i(\delta T_i),$$
(13)

where  $r_i(\delta T_i)$  is the temperature-affected electrical resistance at *i*th tile.

As a result, solving the output  $y(s_i)$  from the MNA at one sink  $s_i$ , the propagation delay  $D_{(src \rightarrow s_i)}$ , from the source node  $n_{src}$  to one of sink  $s_i$  (i = 1, ..., w), is the time required for the node voltage at the output  $y(s_i)$  to pass 65% of the peak voltage under the impulse input at source node. As such, one can calculate the actual clock-skew between two sinks  $s_i$  and  $s_i$  by

$$S(i,j) = D_{(src \to s_i)} - D_{(src \to s_j)}.$$
(14)

In the following, we further discuss how to calculate the impact to S(i,j) from the stochastic temperature perturbation.

### 3.2. Stochastic temperature perturbation

When one piece of clock wire at *i*th tile experiences a temperature gradient, its electrical resistance r(T) changes according to (3). To stamp this change in (7) or (8), the conductance form of (3) is given by

$$g_i(T) = \frac{g^0}{(1 + \beta \cdot \delta T_i)} \approx g^0 \cdot (1 - \beta \cdot \delta T_i), \tag{15}$$

where  $g^0 = 1/r^0$ . Note that the temperature impact  $(\beta \cdot \Delta T_i)$  to the electrical resistance is assumed as a small perturbation  $(\beta \cdot \delta T_i \ll 1)$ .

Next, we show how to calculate the resistance observed at one node  $s_k$  (state-variable  $x_k$ ) ( $k = 1, ..., N_n$ ), which is dependent on the embedding path  $l(M', s_k)$  for one perturbed merging point M' and node  $s_k$ . When M' is determined through a sequence of optimizations, the according embedding path  $l_k$  is also given with a length, i.e., the  $n_k$  selected tiles in the path. As such, one can calculate the temperature-perturbed resistance at one node  $s_k$  by

$$R_{k}(M', s_{k}, T) = \sum_{i=1}^{n_{k}} r_{i}(\delta T_{i}) = R_{k}^{0}(1 + \beta \delta T_{k}),$$

$$R_{k}^{0} = n_{k}r^{0}, \quad \delta T_{k} = \frac{\sum_{i=1}^{n_{k}} \delta T_{i}}{n_{k}}.$$
(16)

Note that if each  $(\beta \delta T_i)$  is small, so does the averaged  $(\beta \delta T_k)$ , which is called *embedding temperature* in this paper. *As such, the conductance is given by* 

$$G_k(T) = G_k^0 \cdot (1 - \beta \delta T_k), \tag{17}$$

where  $G_k^0 = 1/R_k^0$ . Therefore, one can build an explicit relation between conductance  $G_k(T)$  and temperature gradient  $\delta T_k$ .

Recall that in our problem formulation,  $\delta T$  is described by a random variable  $\xi$  with a stochastic distribution in the interval  $[\delta T^{min}, \delta T^{max}]$ . Therefore, the state equation in (8) considering the stochastic temperature perturbation ( $\xi$ ) becomes

$$[G(\zeta) + sC]x(s,\zeta) = Bu(s), \tag{18}$$

$$A(\xi, s)x(\xi, s) = b(s). \tag{19}$$

Clearly, the analysis of  $\xi$  by repeatedly running Monte-Carlo is computationally expensive. A Non-Monte-Carlo stochastic analysis is thereby introduced in the next part.

### 3.3. Non-Monte-Carlo stochastic analysis

or

In this part, we show a Non-Monte-Carlo stochastic analysis by expanding the stochastic state matrix  $A(\xi)$  and state variable  $x(\xi)$  with stochastic orthogonal polynomials (SOPs) [25,26]. This assumes that the random distribution  $\xi$  is related to one stochastic orthogonal polynomial  $\Phi(\xi)$ . For example, for a Gaussian random distribution,  $\Phi_i(\xi)$  is a Hermite polynomial

$$\Phi(\xi) = [1, \, \xi, \, \xi^2 - 1, \dots,]^T.$$
<sup>(20)</sup>

As such,  $A(\xi)$  can be expanded by *nth* order Hermite polynomials

$$A(\xi) = A_0 \Phi_0(\xi) + A_1 \Phi_1(\xi) + \dots + A_n \Phi_n(\xi) = \sum_{k=0}^n A_k \Phi_k(\xi).$$
 (21)

Accordingly, the state variable  $x(\xi)$  can be also expanded by

$$\mathbf{x}(\boldsymbol{\xi}) = \sum_{k=0}^{n} \mathbf{x}_{k} \boldsymbol{\Phi}_{k}(\boldsymbol{\xi}).$$
(22)

One can obtain the mean and the variance of  $x(\zeta)$  from  $E(x(\zeta)) = x_0$ ,

 $Var(x(\xi)) = x_1^2 Var(\xi) + x_2^2 Var(\xi^2 - 1),$ 

This needs to solve  $x_0$ ,  $x_1$  and  $x_2$ , which can be achieved by the point-collocation below.

Assuming a first-order n=1 expansion of  $A(\xi)$  and  $x(\xi)$ , and applying an inner-product with  $\Phi_I(\xi)$  to minimize the residue error

$$\langle \Phi_k, A(\xi)x(\xi) - b \rangle = 0,$$
 (23)

one can obtain an augmented state equation

$$\begin{pmatrix} A_0 & A_1 & 0\\ A_1 & A_0 & 2A_1\\ 0 & 2A_1 & A_0 \end{pmatrix} \times \begin{pmatrix} x_0\\ x_1\\ x_2 \end{pmatrix} = \begin{pmatrix} b\\ 0\\ 0 \end{pmatrix},$$
(24)

to solve  $x_0$ ,  $x_1$  and  $x_2$  just by one-time simulation. Based on  $E(x(\xi))$  and  $Var(x(\xi))$ , one can further calculate the mean (zero-mean for Gaussian) and the variance of the clock-skew  $\delta S(i,j)$  in addition to the nominal clock-skew S(i,j). Afterwards, the maximum skew and its variance at one node is searched through the entire tree as the performance metric.

### 4. Optimization of stochastic clock-skew and variation

In Section 3, we have presented a Non-Monte-Carlo (NMC) clock-skew analysis to consider the temperature uncertainty. To offset the clock-skew caused by the temperature-induced electrical conductance variation, similar to [10,11], we adjust the physical location of merging points as our design freedom. This paper defines the design parameter as every possible perturbation combination (PC) of merging-points at one level. One sequence of PCs are selected till the clock-skew and its variance of one merging point satisfy the budgets at that level. The same process is repeated for the next level with the minimized total displacement of merging points. Such an optimization is driven by the updated sensitivity for each PC, which requires to first parameterize PCs in the stochastic state equation. Moreover, the number of PCs to consider needs to be pruned by only selecting the representative or dominant PCs.

### 4.1. Parameterization of stochastic state equation

Considering the temperature impact is small, one can model the displacement of the merging-point location as a small perturbation to the stochastic state equation in (19). As a result, the corresponding change in the state matrix A is modeled by  $\delta A_j^i$ , given the *j*th perturbation combination (PC) (j = 1, ..., J,  $J = 5^{m_i}$ ) on *i*th level ( $i = 1, ..., N_l$ ). Recall that one merging point can have five possible modified locations and hence  $m^i$  merging points can have  $5^{m^i}$  combinations. Each perturbation combination is treated as a parameter.

Note that  $\delta A_j^i$  is mainly from the change of the electrical conductance when modifying one re-embedding path. As such, based on (16), if the length of the original embedding path  $l_k$  with the embedding temperature  $\delta T_k$  is changed to the new embedding path  $l'_k$  with the new embedding temperature  $\delta T'_k$ , one can have

$$\delta A = \delta G = (\beta G_k^0) \cdot (\delta T_k - \delta T'_k).$$

As such, the perturbed stochastic state equation under all perturbation combinations (PCs) becomes

$$(A + \delta A_1) \cdot (x + \delta x_1) = b,$$
  

$$(A + \delta A_2) \cdot (x + \delta x_2) = b,$$
  

$$\dots$$
  

$$(A + \delta A_J) \cdot (x + \delta x_J) = b.$$
  
(25)

By organizing the expanded terms in the order of perturbations, and defining a new state variable

$$\boldsymbol{x}_P = [\boldsymbol{x}, \delta \boldsymbol{x}_1, \dots, \delta \boldsymbol{x}_J]^T,$$
(26)

one can obtain a parameterized system equation [11]

$$A_P x_P = b_P, (27)$$

with augmented state matrix below

$$A_{P} = \begin{bmatrix} A & 0 & \dots & 0 & 0 & 0 \\ \delta A_{1} & A & 0 & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ \delta A_{j} & 0 & \dots & A & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \delta A_{j} & 0 & 0 & \dots & 0 & A \end{bmatrix},$$
(28)

$$b_P = \left[\underbrace{b, \dots, b}_{J}\right]^T.$$
(29)

In addition, the output is

$$\boldsymbol{y}_{P} = \boldsymbol{L}_{P}^{T} \boldsymbol{x}_{P} = [\boldsymbol{y}, \delta \boldsymbol{y}_{1}, \dots, \delta \boldsymbol{y}_{J}]^{T},$$
(30)

where  $L_p^T$  is the augmented output port matrix, and each  $\delta y_j$  (j = 1, ..., J) is a vector, representing the voltage response and changes for each sink.

As shown in Section 3, by solving the above system Eq. (27) together with Eq. (24) in the time-domain using Backward Euler method [11], the clock-skew *S* and its variation  $\delta S$  at each level can be calculated from the voltage response y(t). Note that as the augmented system is lower-block-triangular. Considering this and that the diagonal block is the same *A*, there is only one computational cost of factoring *A* defined in (24). As a result, the total voltage response in each sink under the perturbation combination  $PC_i^i$  at level i ( $i = 1, ..., \mathcal{N}_i$ , j = 1, ..., J) is

$$y_{pc^{i}}(t) = y^{i}(t) + \delta y^{i}_{i}(t).$$
 (31)

The solving is performed in a bottom-up fashion by solving  $y^1, y^2, \ldots, y^{N_l}$  level by level sequentially.

### 4.2. Compression of perturbation configurations

To optimize the clock-tree under temperature variation, one needs to determine the best positions of merging-points at each level. Clearly, it is computationally expensive if not impossible to explore each possible combination of PCs by solving (27). For example, a typical level in clock-tree contains 200 merging points. If each merging point has five potential perturbations (move along North, South, East, West, Origin), there could be  $J = 5^{200}$  parameterized PCs. As observed in [22,23,11,24], logic sharing can happen frequently in the general-purpose computing system. Though there can be many different tasks or workloads applied, many different tasks will access the computing units (L2 Cache) in a similar pattern. In addition, for the domainspecific computing system, there are always repeated workloads with predictable patterns. As a result, there can exist strong temperature correlation at different regions. When many tiles show the similar temperature distributions, it in turn leads to the similar perturbation for a group of merging points, and merging points in one group can move similarly together along North, South, East, West, or Origin.

This observation inspires us to compress the complexity of parameters (PCs) by studying the temperature correlation. Such a dimension reduction is similar to the conventional classifier problem in pattern analysis [28]. Learning can be applied to identify the eigen-function or classifier from a set of data, because the intrinsic dimensionality of the data sometimes is much smaller than the number of parameters used to describe it. The most popular dimensionality reduction is based on the subspace learning, such as Principal Component Analysis (PCA), which projects the data along the directions where the data varies the most. These directions are generally determined by the eigenvectors of the covariance matrix corresponding to the largest eigenvalues, whose eigenvalues corresponds to the variance of the data along the eigenvector directions. As such, in order to develop an efficient vet accurate clock-tree synthesis, one needs to build a compactly parameterized model with the use of the classifier. which can prune the insignificant parameters by classifying the temperature data. Utilizing the classified temperature data, one can further identify a number of clusters of parameters with a dimension much smaller than the original as follows.

At one level of the clock-tree, there are *m* tiles containing *m* merging points with *m* temperature distributions  $\delta T_1, \ldots, \delta T_m$ . Here, each  $\delta T_i$  is a vector ( $\in \mathbb{R}^{X \times 1}$ ) of sampled temperature data at *i*th tile under different workloads with the power management. It is computationally expensive for any clustering algorithm directly applied on the original temperature data with a large size. Therefore, one needs to find a subset from the original temperature data can be pruned.

Given a set of pre-measured temperature maps, the temperature correlation matrix  $\Sigma$  in (6) can be extracted for merging points at one level. Such a correlation matrix usually can have a number of large-valued entries with the similar magnitude, which generally results in a low-rank matrix. Physically, this indicates that the temperatures of a number of tiles can change in the similar fashion. The similar change in temperatures can lead to the similar displacements of those merging points. Based on this observation, our previous approach in [11] identifies the clusters of PCs by analyzing the temperature correlation matrix (6) via K-means clustering. K-means clustering was initially introduced by [29] in VLSI macromodeling such as the classification of the dominant ports. The computational cost in [29,11] is high, especially at the bottom level. The reason is that the unsupervised K-means clustering is naively employed to the overall correlation matrix  $\Sigma$ , which has a quite large dimension in general. In the following, we have developed a new classifying method to find the dominant PCs with supervision.

- Step-1: Extract the correlation matrix  $\Sigma$  ( $\in \mathbb{R}^{m \times m}$ ) based on (6);
- Step-2: Identify the subspace spanned by  $K (K \leq m)$  columns  $u_1, \ldots, u_K$  from  $U (U \in R^{m \times K})$ , which is obtained by singular-value-decomposition (SVD):  $\Sigma = U \times S \times V (S \in R^{K \times K}, V \in R^{K \times K})$ , where the rank K is obtained when the first K singular values in S are larger than one specified threshold  $\epsilon$ .
- Step-3: Form *m* rows u'<sub>1</sub>,...,u'<sub>m</sub> from *U*, apply *K*-means clustering to u'<sub>1</sub>,...,u'<sub>m</sub>, and find *K* clusters with *K* centroid rows u''<sub>1</sub>,...,u''<sub>K</sub>;
- Step-4: Cluster *m* rows of merging-point temperatures  $\delta T_1, \ldots, \delta T_m$  to *K* clusters with *K* centroid rows of merging-point temperatures  $\delta T_1, \ldots, \delta T_K$ ; so does the clustering of *m* tiles of merging-points to *K* clusters.

The above approach is called principal temperature-map analysis. Different from [11], the *K*-means clustering is applied to the subspace of the correlation matrix. The subspace are *m* much shorter singular vectors, which have already pruned the unimportant data by SVD analysis and hence have a much lower computational cost. Moreover, the use of the above principal temperature-map analysis results in *K* dominant clusters of merging points. As such, instead of trying all  $J = 5^m$  PCs, one only

needs to try  $J' = 5^K$  PCs, where *K* is the order that is much smaller than *m*. Therefore, the calculation of clock-skew and its variance has a much lower complexity, and hence can be further integrated inside the clock-tree optimization discussed below.

### 4.3. Summary of SPECO flow

In summary, the flow of the overall SPECO algorithm is presented in Algorithm 1. After a DME-initialized clock tree construction, the re-embedding by perturbation is determined in a bottom-up fashion level by level. At each level, the merging points are first perturbed with a displacement  $d_0$ . There are five possible displacement directions to be determined. To reduce complexity, *m* merging points are first clustered into *K* clusters based on the principal temperature–map analysis. Then the compressed parameterized state equation is solved and *PC<sup>i</sup>* is decided for whom could lead to the minimum clock-skew and variance. The displacement is continued till the clock-skew budget is satisfied. After obtaining the re-embedding path at level *i*, the procedure is repeated to the upper level.

### Algorithm 1. SPECO Algorithm.

**Input**: source, sinks, initial tree T and correlation  $\Sigma$ **Output**: A re-embedded clock-tree T'

1:  $(N_l \text{ Levels}) \leftarrow \text{Levelize } \mathcal{T}$ 

{Bottom up embedding from the second last level to the second level}

- 2: **for**  $i = N_l 1$  to 1 **do**
- 3: **while**  $S \ge s_0$  and  $\delta S \ge s_\sigma$  **do**
- 4: Assume a displacement of  $d_0$  for *m* merging points
- 5: Find *K* clusters from *m* merging points
- 6: Form Parameterized (27) with  $J' PC'_{js}$
- 7:  $x_{s_k,s_l}$ ,  $\forall sinks \leftarrow$  Solve clock-skew S(k,l) and its variance  $\delta S(k,l)$  based on (24) and (28)
- 8: Decide *PC*<sup>*i*</sup>s for *K* clusters
- 9: end while
- 10: Embed in level *i*
- 11: end for
- 12: Find the maximum skew and its variance
- 13: return T'

### 5. Numerical experiment results

The proposed SPECO algorithm is implemented in C++ and Matlab. The experimental data is measured on a Linux server with 1.9 GHz CPU and 2Gb memory. the DME method in [27] and TACO method in [10] are also implemented for the comparison. The clock-tree r1-r5 in [16] are used as the benchmark. The initial tree is constructed by the DME method.

A chip with size 6 cm<sup>2</sup> is divided into a uniform grid with  $10 \times 10$  tiles to sample the non-uniformly distributed temperature map. The temperature maps at nodes of the grid are obtained from a microarchitecture level power and temperature transient simulator [23] by applying six SPEC2000 applications (*art, ammp, compress, equake, gcc, gzip*). The six applications are assumed applied in one sequence and their order in the sequence is determined by one stochastic DPM scheduler. Under this stochastic DPM, 1000 random temperature maps are generated with a sampling rate by 10 million clock cycles. The range of the temperature variation at one tile is observed about 50 °C. The macromodel (4th-order) developed in [11] is used to generate both the transient voltage response, and then to calculate the clock-skew and its variance. The maximum skew and its variance are searched through the whole tree and are recorded for the comparison. In addition, all simulations assume the unit resistance  $r_0 = 0.03 \ \Omega/\mu m$ , unit capacitance  $c_0 = 2.0 \times 10^{-17} F/\mu m$ and  $\beta = 0.0068$  [10,11], where  $\beta$  is the temperature coefficient of resistance.

# Correlation Map

Fig. 4. The spatial correlation map under a given sequential applications.



**Fig. 5.** Singular-value distribution of the correlation matrix. *x*-axis is the order of the singular values.

### 5.1. Principal temperature-map analysis

Given the 1000 sequences of power traces generated by the stochastic DPM for six applications, one can obtain a set of 1000 stochastic temperature maps. The stochastic temperature distribution at each tile is extracted based on (4)-(6). Fig. 4 shows the spatial distribution of one calculated correlation matrix. The result shows that the correlation is strong as the average correlation strength is about 0.8, which indicates the clustering based on the correlation would be effective. As such, instead of using one fixed and assumed temperature map in [10], our SPECO considers a set of stochastic temperature maps, which include the mean and variance for each chip region, and the spatial correlation between variances for different regions.

Then, the principal temperature-map analysis (PTA) is applied to compress the redundant parameterized perturbations, and is further compared with the direct *K*-means (Kmean) method used in [11]. SVD is first applied to explore the rank of correlation matrix  $\Sigma$ . Fig. 5 shows a distribution of singular values. Clearly, the dominant singular values of  $\Sigma$  decay sharply in a deceasing order. Thus, a low-rank representation of perturbed merging

### Table 3

The wirelength comparison for DME, TACO, SPECO-1 (Kmean) and SPECO-2(PTA) on benchmarks r1 to r5.

| Ckt (sink#) | Wirelength (µm) |                |            |            |  |  |
|-------------|-----------------|----------------|------------|------------|--|--|
|             | DME             | TACO           | SPECO-1    | SPECO-2    |  |  |
|             | (avg. of 1000)  | (avg. of 1000) | (one time) | (one time) |  |  |
| r1(267)     | 1.30e06         | 1.42e06        | 1.34e06    | 1.30e06    |  |  |
| r2(598)     | 2.59e06         | 2.63e06        | 2.60e06    | 2.58e06    |  |  |
| r3(862)     | 3.37e06         | 3.50e06        | 3.39e06    | 3.38e06    |  |  |
| r4(903)     | 6.81e06         | 6.95e06        | 6.85e06    | 6.82e06    |  |  |
| r5(3101)    | 1.01e07         | 1.12e07        | 1.05e07    | 1.02e07    |  |  |
| Avg.        | 1.0             | 6%             | 2%         | 1%         |  |  |

### Table 4

Runtime comparison between the Monte-Carlo method with each run using DME/ TACO; and the Non-Monte-Carlo method using one-time SoP based SPECO on benchmarks r1 to r5.

| Ckt                              | Runtime (s)       |                   |                   |                        |                    |                   |
|----------------------------------|-------------------|-------------------|-------------------|------------------------|--------------------|-------------------|
| (SIIIK#)                         | Kmean PTA         |                   | DME (avg.         | TACO (avg.<br>of 1000) | SPECO (one-time)   |                   |
|                                  |                   |                   | 01 1000)          |                        | Macromodel         | Optimization      |
| r1 (267)<br>r2 (598)<br>r3 (862) | 0.2<br>0.9<br>2.3 | 0.1<br>0.5<br>1 0 | 0.5<br>1.0<br>1.2 | 1.9<br>10.4<br>28 2    | 1.1<br>6.9<br>16.9 | 0.2<br>0.6<br>1 3 |
| r4 (903)<br>r5 (3101)            | 2.6<br>12.8       | 1.1<br>5.7        | 4.1<br>7.2        | 90.3<br>256.2          | 43.0<br>164.4      | 1.1<br>1.4        |
| Avg.                             | 2.2 ×             | 1.0               | $144 \times$      | $1596 \times$          | 1                  | .0                |

### Table 2

The skew and skew variance comparison for DME, TACO, SPECO-1 (Kmean) and SPECO-2(PTA) on benchmarks r1 to r5.

| Ckt (sink#)                                          | DME (avg. of 1000)                      |                                    | TACO (avg. of                           | TACO (avg. of 1000)               |                                          | SPECO-1 (one time)                  |                             | SPECO-2 (one time)         |  |
|------------------------------------------------------|-----------------------------------------|------------------------------------|-----------------------------------------|-----------------------------------|------------------------------------------|-------------------------------------|-----------------------------|----------------------------|--|
|                                                      | Skew(ps)                                | Var(ps)                            | Skew(ps)                                | Var(ps)                           | Skew(ps)                                 | Var(ps)                             | Skew(ps)                    | Var(ps)                    |  |
| r1(267)<br>r2(598)<br>r3(862)<br>r4(903)<br>r5(3101) | 96.6<br>358.5<br>393.3<br>952.7<br>2359 | 10.2<br>42.3<br>51.4<br>105<br>263 | 88.4<br>226.4<br>219.1<br>669.2<br>1258 | 10.5<br>40.1<br>49.5<br>98<br>256 | 71.8<br>108.5<br>134.2<br>190.3<br>923.7 | 4.6<br>13.1<br>15.7<br>21.3<br>95.3 | 37<br>76<br>82<br>88<br>315 | 3<br>7.5<br>8.5<br>7<br>54 |  |
| Avg.                                                 | 6.08 ×                                  | $6.99 \times$                      | 3.93 ×                                  | 6.68 ×                            | 2.1 ×                                    | 1.98 ×                              | 1.0                         | 1.0                        |  |

points by clustering could yield a good approximation. In this case, a rank-3 approximation is selected to generate 3 clusters to consider  $5^3$  (125) perturbation configurations (PCs).

The approach in [11] deploys the Kmean clustering directly to the correlation matrix, given the rank of the approximation. This approach is inaccurate and expensive when the number of merging points to be clustered is large. Instead, the PTA developed in this paper applies the pattern recognition in a supervised fashion [28] by pruning. In short, as discussed in Section 4.2, the subspace *U* of  $\Sigma$ , a by-product obtained for SVD, is used for the clustering study instead of  $\Sigma$  itself. When compared to SPECO-1 with Kmean by SPECO-2 with PTA, Table 2 shows that SPECO-2 with PTA reduces the skew by 2.1 × and reduces the skew variance by 1.98 × . Table 3 shows that



Clock Tree r3: 1724 nodes

**Fig. 6.** Initial clock tree (shown in black dash-line), and optimized clock tree (shown in red dot-line) after SPECO of r3. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

SPECO-2 reduces the wire-length by  $2 \times$ . Table 4 further shows that SPECO-2 reduces the runtime by  $2.2 \times$ .

### 5.2. Comparison with DME and TACO

We further compare the performance of synthesized clock-tress by DME, TACO and SPECO, respectively. For both DME and TACO: (1) Elmore delay is used for skew calculation; and (2) skew and wirelength are the averaged results under 1000 temperature maps by Monte-Carlo (MC) method. In contrast, SPECO uses (1) more accurate 4th-order macromodel; and (2) only one-time Non-Monte-Carlo (NMC) method. Moreover, the maximum skew is identified by searching the skew at all nodes. First, Fig. 6 compares the topologies of the initial clock-tree (shown in black dash-line) and the reembedded one (shown in red dot-line) for r3 benchmark. Fig. 7 further compares distributions of skew and variance for r1-r4 benchmarks before and after the SPECO optimization at one node with inputs of 1000 temperature maps. Clearly, with the SPECO optimization both skew and skew variance in the four cases are reduced.

Next, Table 2 compares the skew and the skew variance of the synthesized clock-tree by DME, TACO and SPECO. Here, the results of SPECO-II are used as the base. Clearly, the averaged skew and skew-variance of the clock-tree by DME are  $6.08 \times$  and  $6.99 \times$  larger than SPECO, and those by TACO are  $3.93 \times$  and  $6.68 \times$  larger than SPECO. Moreover, Table 2 further compares the wire-length. Because of the re-embedding, the overall wire-length of TACO and SPECO are both larger than DME. Note that as the perturbation is bounded, the overhead on wire-length is 6% by TACO, 2% by SPECO-1 and 1% by SPECO-2 on average.

The runtime comparison is also reported in Table 4. The runtime here for SPECO includes the time to build macromodel and the one to find optimized merging point. Though for one synthesis under one temperature map the overall runtime of SPECO is lager than both DME and TACO, the total runtime of SPECO for 1000 maps is still much smaller than DME and TACO. This is because only one-



Fig. 7. Comparison of skew distributions before and after SPECO under all 1000 sequences (or temperature maps) at one node of bottom level for r1-r4 benchmarks.

time Non-Monte-Carlo analysis is performed by SPECO. As such,  $144 \times$  and  $1596 \times$  speedups are observed by SPECO when compared to DME and TACO, respectively.

### 6. Conclusion

For the clock-tree synthesis considering the temperature uncertainty, this paper has presented one new problem formulation: Stochastic PErturbation based Clock Optimization (SPECO). This optimization is to modify the clock-tree topology and to adjust the additional skew induced from the temperature uncertainty. To efficiently find the perturbed clock-tree topology under a set of stochastic temperature maps, the Non-Monte-Carlo method is applied to identify the temperature induced clock-skew and variance by one-time analysis. Moreover, the principal temperature-map analysis is applied to reduce the design complexity by clustering merging points based on the subspace of the correlation. As a result, SPECO can be efficiently deployed to identify the new positions of the perturbed merging points for clock-tree optimization.

The experimental results show that when compared to DME on average, SPECO reduces the skew by  $6.08 \times$  with wire-length overhead 1%, while TACO reduces the skew by  $3.93 \times$  with wire-length overhead 6%. Moreover, only SPECO can reduce the skew variance during the optimization. In addition, SPECO only requires one-time Non-Monte-Carlo analysis but DME and TACO search the worst-case skew from 1000 temperature maps. The new principal temperature-map analysis in SPECO also improves the quality of the synthesized tree and reduces the runtime by  $2.2 \times$  compared to the use of *K*-means method.

### References

- P. Restle, et al., A clock distribution network for microprocessors, IEEE Journal of Solid-state Circuits (2001) 792–799.
- N. Kurd, et al., Multi-ghz clocking scheme for intel(r) pentium(r) 4 microprocessor, in: IEEE International Solid-State Circuits Conference (ISSCC), 2001.
   C.C. Teng, Y.K. Cheng, E. Rosenbaum, S.M. Kang, iTEM: a temperature-
- dependent electromigration reliability diagnosis tool, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (1997) 882–893.
- [4] K. Banerjee, A. Mehrotra, A. Sangiovanni-Vincentelli, C. Hu, On thermal effects in deep sub-micron VLSI interconnects, in: Proceedings of Design Automation Conference, 1999.
- [5] A.H. Ajami, K. Banerjee, M. Pedram, Modeling and analysis of nonuniform substrate temperature effects on global ULSI interconnects, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2005) 849–861.
- [6] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, M.R. Stan, Hotspot: a compact thermal modeling methodology for early-stage VLSI design, IEEE Transactions on VLSI System (2006) 501–513.
- [7] M. Pedram, S. Nazarian, Thermal modeling, analysis and management in VLSI circuits: principles and methods, in: Proceedings of IEEE, 2006, pp. 1487–1501.
- [8] H. Yu, Y. Shi, L. He, T. Karnik, Thermal via allocation for 3-d ics considering temporally and spatially variant thermal power, IEEE Transactions on VLSI System (2006) 1609–1619.
- [9] H. Yu, J. Ho, L. He, Allocating power ground vias in 3d ics for simultaneous power and thermal integrity, ACM Transactions on Design Automation of Electronics Systems, 2009.
- [10] M. Cho, S. Ahmed, D.Z. Pan, TACO: temperature aware clock-tree optimization, in: Proceedings of International Conference on Computer-Aided Design, 2005.
- [11] H. Yu, Y. Hu, C. Liu, L. He, Minimal skew clock embedding considering time variant temperature variation, in: Proceedings of International Symposium on Physical Design, 2007.
- [12] A. Chakraborty, K. Duraisami, A. Sathanur, P. Sithambaram, L. Benini, A. Macii, E. Macii, M. Poncino, Dynamic thermal clock skew compensation using tunable delay buffers, IEEE Transactions on VLSI System (2008) 639–649.
- [13] H. Wang, H. Yu, S.X.-D. Tan, Fast clock skew analysis considering environmental uncertainty by parameterized and incremental macro-models, in: Proceedings of Asia South Pacific Design Automation Conference, 2009.
- [14] R.-S. Tsay, An exact zero skew clock routing algorithm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (1993) 242–249.
- [15] T.-H. Chao, Y.-C.H. Hsu, J.-M. Ho, K.D. Boese, A.B. Kahng, Zero skew clock routing with minimum wirelength, IEEE Transactions on Circuits System 39 (November) (1992) 799–814.
- [16] J. Cong, A.B. Kahng, C.-K. Koh, C.-W.A. Tsao, Bounded-skew clock and steiner routing, ACM Transactions on Design Automation of Electronics Systems, 1997.

- [17] A. Rajaram, J. Hu, R. Mahapatra, Reducing clock skew variability via cross links, in: Proceedings of Design Automation Conference, 2004.
- [18] M. Guthaus, D.S.R. Brown, Clock buffer and wire sizing using sequential programming, in: Proceedings of Design Automation Conference, 2006.
- [19] Y. Chen, C. Dong, D. Chen, Clock tree synthesis under aggressive buffer insertion, in: Proceedings of Design Automation Conference, 2010.
- [20] D. Brooks, M. Martonosi, Dynamic thermal management for high-performance microprocessors, in: IEEE International Symposium on High-Performance Computer Architecture, 2001.
- [21] K. Skadron, M. Stan, et al., Temperature-aware microarchitecture, in: Proceedings of International Symposium on Computer Architecture, 2006.
- [22] N. Kim, T. K. Stan, V. Bertacco, T. Austin, T. Mudge, Microarchitectural power modeling techniques for deep sub-micron microprocessors, in: Proceedings of International Symposium on Low Power Electronics and Design, 2004.
- [23] W. Liao, L. He, K. Lepak, Temperature and supply voltage aware performance and power modeling at microarchitecture level, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2005) 1042–1053.
- [24] H.-S. Jung, M. Pedram, Uncertainty-aware dynamic power management in partially observable domains, IEEE Transactions on VLSI System, 2009.
- [25] D. Xiu, G. Karniadakis, The Wiener-Askey polynomial chaos expansion for stochastic differential equations, SIAM Journal of Scientific Computing (2002) 619–644.
- [26] S. Vrudhula, J.M. Wang, P. Ghanta, Hermite polynomial based interconnect analysis in the presence of process variations, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 1 (2006) 2001–2011.
- [27] T.H. Chao, Hsu Y-C, Ho J-M, K.D. Boese, A. Kahng, Zero skew clock routing with minimum wirelength, IEEE Transactions on Circuits System II (1992) 799–814.
- [28] S. Theodoridis, K. Koutroumbas, Pattern Recognition, 3rd ed., Academic Press, 2006.
- [29] P. Liu, X. Tan, et al., Efficient method for terminal reduction of interconnect circuits considering delay variations, in: Proceedings of the International Conference on Computer-Aided Design, 2005.



Sina Basir-Kazeruni received his B.A.Sc in Electrical Engineering from the University of Waterloo, Canada in 2010. During his undergraduate studies, he interned at various companies including NVIDIA, Synopsys, and Honeywell Aerospace. Sina is currently completing his MS in Electrical Engineering at the University of California, Los Angeles. His research interests include design and optimization of digital integrated circuits and power/area-efficient VLSI systems.



**Hao Yu** obtained his B.S. degree from the Fudan University (Shanghai China) in 1999, and M.S./Ph.D degrees both from Electrical Engineering Department at UCLA in 2007, with major of the integrated circuit and embedded computing. He was a senior research staff at Berkeley Design Automation(BDA). Since October 2009, he is an Assistant Professor at School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. His primary research interests are 3D computing system and designs at nano-tera scale. He has 60 peer-reviewed publications, 5 book/ chapters, best paper award in ACM Transactions on Design Automation of Electronic Systems(TODAES),

best paper award nominations in DAC/ICCAD/ASP-DAC, and inventor award from semiconductor research cooperation(SRC). He is the Associate Editor and technical program committee member of several journals and conferences.



Fang Cong received his B.S. degree from the Computer Science Department at the Beijing University of Aeronautics and Astronautics in 2005. Also, he graduated from Computer Science Department at the Tsinghua University with M.S. degree in 2008. After that, he continues to work toward his Ph.D. degree in the Electrical Engineering Department at the University of California, Los Angeles. His research interests mainly focus on stochastic modeling and simulation for CAD. He also works on healthcare embeded system, parallel and distributed computing.



Yu Hu received his Ph.D. degree from the Department of Electrical Engineering at the University of California, Los Angeles (UCLA), and his M.Eng. and B.Eng. degrees both from Computer Science and Technology Department at Tsinghua University, Beijing, China. He is now a professor at the Department of Electronic Science and Technology of Huazhong University of Science and Technology (HUST). Before joining HUST, he was an Assistant Professor of the Department of Electrical and Computer Engineering at University of Alberta, Edmonton, Canada. His research interests include general aspects of cyber physic systems (CPS), computer architectures, CAD algorithms and applications of field programmable gate

arrays (FPGAs) and embedded systems with applications on Internet of Things. He has published over 40 papers in the major journals and conferences (including IEEE Transactions, DAC and ICCAD) in the field of FPGAs and CAD. Dr. Hu is the recipient of the Outstanding Graduate Student Award from Tsinghua University in 2005, and he is the co-recipient of the Best Contribution Award at IEEE International Workshop of Logic and Synthesis 2008. His work has been nominated for the Best Paper Award three times in International Conference on Computer-Aided Design and Design Automation Conference from 2008 to 2010.



**Chun-Chen Liu** received the B.S. degree in electrical engineering from the National Cheng Kung University, Tainan, Taiwan, R.O.C., in 2003, the M.S. degree in electrical engineering and computer science from the University of California, San Diego co-pai research with U C Berkeley in 2008, and is currently working in Samsung wireless telecommunication research center. He was a Research Assistant in the University of California, Los Angeles from 2006 to 2007. Mr. Liu was the recipient of IBM problem solving award based on the use of the EIP tool suite in 2007. Two of his papers were also nominated as the Best Paper Award candidate at the International Conference on Computer-Aided Design in 2007.



Lei He is a professor at electrical engineering department, University of California, Los Angeles (UCLA) and was a faculty member at the University of Wisconsin, Madison between 1999 and 2002. He also held visiting or consulting positions with Cadence, Empyrean Soft, Hewlett-Package, Intel, and Synopsys, and was technical advisory board member for Apache Design Solutions and Rio Design Automation. Dr. He obtained Ph.D. degree in computer science from UCLA in 1999. His research interests include modeling and simulation, VLSI circuits and systems, and cyber physical systems. He has published one book and over 200 technical papers with 12 best paper nominations

mainly from Design Automation Conference and International Conference on Computer-Aided Design and five best papers or best contribution awards including the ACM Transactions on Electronic System Design Automation 2010 Best Paper Award.