# Power Optimal Dual-Vdd Buffered Tree Considering Buffer Stations and Blockages 


#### Abstract

This paper presents the first in-depth study on applying dual $V_{d d}$ buffers to buffer insertion and multi-sink buffered tree construction for power minimization under delay constraint. To tackle the problem of dramatic complexity increment due to simultaneous delay and power consideration and increased buffer choices, we develop a sampling-based subsolutions (i.e. options) propagation method and a balanced search tree based data structure for option pruning. We obtain 17 x speedup with little loss of optimality compared to the exact option propagation. When dual $V_{d d}$ buffers are considered, our algorithm reduces power by $23 \%$ at the minimum delay specification compared to single $V_{d d}$ buffer insertion. Moreover, compared to the delay-optimal tree using single $V_{d d}$ buffers, our power-optimal buffered tree reduces power by $7 \%$ and $18 \%$ at the minimum delay specification when single $V_{d d}$ and dual $V_{d d}$ buffers are used respectively.


## 1. INTRODUCTION

Aggressive scaling of VLSI circuits makes interconnects the performance bottleneck, and buffer insertion is used extensively to reduce interconnect delay at the expense of more power dissipation. [1] develops a power-optimal buffer insertion algorithm to meet the delay specification. In addition, the buffered tree construction problem for multi-sink nets has also been studied in $[2,3]$, without considering buffer stations $(B S)$ or blockages. [4, 5] present two construction approaches to account for blockage avoidance and $B S$ and quickly explore a few alternative routes for the purpose of delay minimization. [6] presents an optimal delay routing algorithm that also considers $B S$ and blockages while [7] enhanced it with several speed-up techniques. All the buffered tree construction methods do not consider power explicitly.

Recently, $V_{d d}$-promgrammble buffers have been used to reduce FPGA interconnect power [8]. As buffers are preplaced, the dual $V_{d d}$ buffer routing is simplified to dual $V_{d d}$ assignment. However, dual-Vdd buffer insertion and buffered tree construction for ASIC designs is more complicated and have not been studied. We present the first in-depth study on applying dual $V_{d d}$ buffers to buffer insertion ( $D V B$ ) and multi-sink buffered tree generation ( $D$-Tree ) considering both $B S$ and blockages for power minimization under delay constraint. We develop a sampling-based subsolutions (i.e. options) propagation method and a balanced search tree based data structure for option pruning to reduce computational complexity. We obtain a $17 x$ speedup with little loss of optimality over [1] which has exact option propagation. Experimental results show that our $D V B$
algorithm reduces power by $23 \%$ over [1] at the minimum delay specification. Moreover, our $D$-Tree algorithm reduces power by $7 \%$ and $18 \%$ at the minimum delay specification over $[6,7]$ when single $V_{d d}$ and dual $V_{d d}$ buffers are used respectively.

Section 2 discusses the dual $V_{d d}$ buffer modeling, the $D V B$ and the $D$-Tree problem formulations. Section 3 and 4 give the details of the algorithms for solving the $D V B$ and the $D$-Tree problems and their respective experimental results. We conclude the paper in Section 5.

## 2. MODELING AND PROBLEM FORMULATION

### 2.1 Delay, Slew Rate and Power Model

We use distributed Elmore delay model as in $[6,4,7,5]$. The delay due to a piece of wire of length $l$ is given by

$$
\begin{equation*}
d(l)=\left(\frac{1}{2} \cdot c_{w} \cdot l+c_{l o a d}\right) \cdot r_{w} \cdot l \tag{1}
\end{equation*}
$$

where $c_{w}$ and $r_{w}$ are the unit length capacitance and resistance of the interconnect and $c_{\text {load }}$ is the capacitive loading at the end of the wire. We also use Elmore delay times $\ln 9$ as the slew rate metric [9]. The delay of a buffer is given by

$$
\begin{equation*}
d_{b u f}=d_{i n t}+r_{o} \cdot c_{\text {load }} \tag{2}
\end{equation*}
$$

where $d_{i n t}, r_{o}$ and $c_{l o a d}$ are the intrinsic delay, output resistance and capacitive loading at the output of the buffer respectively. We obtain $r_{o}$ and $d_{i n t}$ for both high $V_{d d}$ and low $V_{d d}$ buffers, and we observe that both values are higher for low $V_{d d}$ buffers.

In the context of buffer insertion with upper slew rate constraint, we observe that slew rates at the buffer inputs and the sinks are always within up to a few tens $p s$ of the upper bound. Therefore we model buffer delay with negligible error by assuming input slew rate close to the upper bound To illustrate, we consider a rather loose slew rate bound $\hat{s}=100 \mathrm{ps}$, which is $20 \%$ of the clock period for a 2 Ghz clock. Using the formula for delay optimal buffer insertion length $l_{\text {opt }}$ [10] for a single interconnect wire and the information of the 65 nm technology node [11] that is summarized in Table 1, we get $l_{o p t}=3924 \mu \mathrm{~m}$. On the other hand, from Equation (1) and assuming $c_{\text {load }}=0$ and infinite size driver at the input of the wire, we have $l_{\text {slew }}=3073 \mu \mathrm{~m}$, which is the largest length of an unbuffered interconnect without slew rate violation and is much smaller than $l_{\text {opt }}$. Considering power optimality increases the optimal insertion length
$l_{\text {opt }}$ [10], while considering non-zero wire loading $c_{\text {load }}$, additional delay due to finite driver strength and a tighter slew rate bound further decreases $l_{\text {slew }}$. Therefore we tend to insert buffers in order to the satisfy the slew rate bound, which results in close-to-bound slew rates.

| Settings | Values |
| :---: | :---: |
| Simulators | QuickCap [12] (interconnect) <br> BSIM 4 + SPICE [13] (device) |
| Interconnect | $r_{w}=0.186 \Omega / \mu m, c_{w}=0.0519 f F / \mu m$ <br> $(65 \mathrm{~nm}$, global, min space and width) |
| Buffer <br> (min size) | $c_{d d}=0.47 \mathrm{fF}$, <br> $V_{d d}=1.2 \mathrm{~V}: r_{o}=4732 \Omega, d_{\text {int }}=72.0 \mathrm{Vs}: r_{o}=5364 \Omega, d_{\text {int }}=97.7 p s$ |

Table 1: Settings for the 65 nm global interconnect.
Note that more accurate slew rate and delay models that support bottom up (i.e. sink-to-source) calculation such as [14] can be used instead. The use of simpler delay and slew metrics here is for the ease of implementation and explanation only.

We measure interconnect power by energy per switch. The energy per switch for an interconnect wire of length $l$ is

$$
\begin{equation*}
E_{w}=0.5 \cdot c_{w} \cdot l \cdot V_{d d}^{2} \tag{3}
\end{equation*}
$$

We collapse per switch short-circuit and dynamic power consumed by a buffer into a single value $E_{b u f}$, which is a function of both $V_{d d}$ and buffer size. We observe that low $V_{d d}$ buffers have a much smaller $E_{b u f}$ than the same-sized high $V_{d d}$ counterparts. On the other hand, since the leakage component of the total energy consumption depends very much on the operating frequency and the switching activity, we choose not to include leakage power in our study.

### 2.2 Dual $V_{d d}$ Technique

Dual $V_{d d}$ buffering uses both high $V_{d d}$ and low $V_{d d}$ buffers in interconnect synthesis. Designs using low $V_{d d}$ buffers consume less buffer $E_{b u f}$ and interconnect (Equation 3) power. Applying this technique to non-critical paths, we achieve power saving without worsening the delay of the overall interconnect tree.

We only allow high $V_{d d}$ buffers followed by low $V_{d d}$ buffers but not the reverse except at right before the sinks. A high $V_{d d}$ buffer can drive a low $V_{d d}$ buffer, but a low $V_{d d}$ buffer driving a high $V_{d d}$ one may cause a large leakage power. Therefore, a $V_{d d}$-level converter must be inserted between the low $V_{d d}$ buffer and its high $V_{d d}$ fanout buffers. We assume that the driver at the source operates at high $V_{d d}$ and $V_{d d}$-level converters are placed only at the sinks driven by low $V_{d d}$ buffers. We do not consider $V_{d d}$-level converters explicitly in our algorithm, which reduces power and delay overhead that is introduced by using a large number of $V_{d d^{-}}$ level converters. Considering a simple case in Figure 1, the configuration in (a) must have a larger power than that in (b) due to the the level converter and the fact that the low $V_{d d}$ buffer instead of the high $V_{d d}$ buffer is driving the load $C_{l}$. To have the delay of case (b) larger than that of (a), we make

$$
\begin{equation*}
\left(R_{b}^{L}-R_{b}^{H}\right) \cdot C_{l}+R_{b}^{H} \cdot C_{b}^{L}-R_{b}^{L} \cdot C_{L C}-R_{L C} \cdot C_{b}^{H}-d_{L C} \geq 0 \tag{4}
\end{equation*}
$$

where $d_{L C}$ is the intrinsic delay of the level converter and all other parameters are shown in Figure 1. By trying all combinations of buffer sizes ( $16 \mathrm{x}, 32 \mathrm{x}, 64 \mathrm{x}$ in our study)
and by using the parasitic values for a properly sized level converter, we have $C_{l}$ to be at least $0.5 p F$, or equivalently a 9 mm long global interconnect worth of capacitance, for Equation (4) to become true, which is extremely unlikely in any buffered interconnect design. Therefore (b), which has no level converter, is very likely to be a superior design than (a). This provides us with the insight about excluding level converters in our study, which saves runtime by considering a smaller and more productive solution space.


Figure 1: Demonstrating level converter overhead.

### 2.3 Dual $V_{\mathrm{dd}}$ Buffer Insertion Problem

We assume that the loading capacitance and the required arrival times $(R A T) q_{n}^{s}$ are given at all sink terminals $n_{s}$. We assume that the driver resistance at the source node $n_{s r c}$ is given. We also assume that all types of buffers can be placed only at the buffer candidate nodes $n_{b}^{k}$. We use the $R A T$ at the source $n_{\text {src }}$ to measure delay performance. Our goal is to minimize power of the interconnect subject to the $R A T$ constraint at the source $n_{s r c}$.

Definition 1. The required arrival time $(R A T) q_{n}$ at node $n$ is defined as

$$
q_{n}=\min _{n_{s} \forall s}\left(q_{n}^{s}-d\left(n_{s}, n\right)\right)
$$

where $d\left(n_{s}, n\right)$ is the delay from the sink node $n_{s}$ to $n$.
Dual $V_{d d}$ Buffer Insertion ( $\boldsymbol{D V B}$ ) Problem - Given an interconnect fanout tree which consists of a source node $n_{s r c}$, sink nodes $n_{s}$, Steiner nodes $n_{p}$, candidate buffer nodes $n_{b}$ and the connection topology among them, the $D V B$ Problem is to find a buffer placement, a buffer size assignment and a $V_{d d}$ level assignment solutions such that the $R A T q_{n}^{s r c}$ at the source $n_{s r c}$ is met and the power consumed by the interconnect tree is minimized, while slew rate at every input of the buffers and the sinks $n_{s}$ are upper bounded by $\hat{s}$.

### 2.4 Dual $V_{\mathrm{dd}}$ Buffered Tree Construction

We measure the delay and power performance using the same metric as in the $D V B$ formulation. Assuming that a floorplan of the layout is available, we can identify the locations and shapes of rectangular blockages, which allows wiring on top but forbid buffer insertion, and the locations of the buffer station $(B S)$ which are the allocated space for buffer insertion. Therefore we have the following problem formulation.

Dual $V_{d d}$ Buffered Tree Construction ( $D$-Tree ) Problem - Given the locations of a source node $n_{s r c}$, sink nodes $n_{s}$, blockages and $B S$, the $D$-Tree problem is to find the minimum power embedded rectilinear spanning tree with
the buffer placement, the buffer sizes and the $V_{d d}$ assignment on the floorplan that satisfy the $R A T q_{n}^{s r c}$ constraint at the source $n_{s r c}$ and the slew rate bound $\hat{s}$ at every input of the buffers and the sinks $n_{s}$.

In the $D$-Tree problem, we have alternative tree topologies as an extra dimension over the $D V B$ problem for optimization. Two $D$-Tree solutions are shown in Figure 2. The large rectangle and the black dots are the blockage and the $B S$ respectively. Both cases achieve the same $R A T$ at the source $n_{\text {src }}$. However, (a) has to go across a wide blockage and therefore has to rely on running a long high $V_{d d}$ net. An alternative route is shown in Figure 2(b) in which it chooses to go around the blockage so that it can insert more buffers to achieve the same delay while keeping the long route at low $V_{d d}$, which turns out to save power compared to (a).


Figure 2: Routing as a design freedom for power.

## 3. BUFFER INSERTION

Power-optimal solutions are constructed from partial solutions from the subtrees. We call them as options, which are defined below.

Definition 2. An option $\Phi_{n}$ at the node $n$ refers to the buffer placement, size and $V_{d d}$ assignment for the subtree $T_{n}$ rooted at $n$. To perform delay and power optimization, the option is represented as a 4 -tuple $\left(c_{n}, p_{n}, q_{n}, \theta_{n}\right)$, where $c_{n}$ is the downstream capacitance at $n, p_{n}$ is the total power of $T_{n}, q_{n}$ is the RAT at $n$ and $\theta_{n}$ signifies whether there exists any high $V_{d d}$ buffer at the downstream. The option with the smallest power $p_{n}^{s r c}$ at the source node $n_{s r c}$ is the power-optimal solution.

Our algorithm is based on [1] with a few improvements. We add support for dual $V_{d d}$ buffer insertion without level converters. We also improve the runtime by introducing uniform sampling of the options under each capacitance value to reduce the number of options generated with negligible loss of optimality. To facilitate explanation, we define the concept of option dominance here.

Definition 3. An option $\Phi_{1}=\left(c_{1}, p_{1}, q_{1}, \theta_{1}\right)$ dominates another option $\Phi_{2}=\left(c_{2}, p_{2}, q_{2}, \theta_{2}\right)$ if $c_{1} \leq c_{2}, p_{1} \leq p_{2}$ and $q_{1} \geq q_{2}$.

### 3.1 Baseline Algorithm

We enhance the dynamic programming framework in [1] to accomodate the introduction of dual $V_{d d}$ buffers, which is summarized in Table 2. We use the same notation as in Definition 2 to denote options $\Phi$ and their components. Moreover, we use $c_{b}^{k}, E_{b}^{k}, V_{b}^{k}$ and $d_{b}^{k}\left(c_{\text {load }}\right)$ to denote the
input capacitance, the power, the $V_{d d}$ level and the delay with output load $c_{\text {load }}$ of the buffer $b_{k} . d_{n, v}$ and $E_{n, v}(V)$ are the delay and the power of the interconnect between nodes $n$ and $v$ operating at voltage $V$. The set of available buffers $\operatorname{Set}(B)$ contains both low $V_{d d}$ and high $V_{d d}$ buffers. We first call $D P$ at the source node $n_{s r c}$, which recursively visits the children nodes and enumerates all possible options in a bottom up manner until the entire interconnect tree $T_{n}^{s r c}$ is traversed.

```
Algorithm: \(\quad D P\left(T_{n}, \operatorname{Set}(B)\right)\)
    \(\operatorname{Set}\left(\Phi_{n}\right)=\left(c_{n}^{s}, 0, q_{n}^{s}\right.\), false \()\) if \(n\) is a sink
            else \((0,0, \infty\), false \()\)
    for each child \(v\) of \(n\)
        \(\operatorname{Set}\left(\Phi_{v}\right)=\) sampled \(D P\left(T_{v}\right)\)
        \(\operatorname{Set}\left(\Phi_{t e m p}\right)=\operatorname{Set}\left(\Phi_{n}\right)\)
        \(\operatorname{Set}\left(\Phi_{n}\right)=\emptyset\)
        for each \(\Phi_{i} \in \operatorname{Set}\left(\Phi_{v}\right)\)
            for each \(\Phi_{t} \in \operatorname{Set}\left(\Phi_{\text {temp }}\right)\)
            for each buffer \(b_{k} \in \operatorname{Set}(B)\)
            /* also contains the no buffer option \(\phi^{*}\) /
            if \(b_{k}=\phi\)
                \(V_{n}=V_{H}\) if \(\theta_{i}\) or \(\theta_{t}\) is true, else \(V_{L}\)
                \(\Phi_{\text {new }}=\left(c_{i}+c_{t}, p_{i}+p_{t}+E_{n, v}\left(V_{n}\right)\right.\),
                \(\min \left(q_{t}, q_{i}-d_{n, v}\right), \theta_{i}\) or \(\left.\theta_{t}\right)\)
            else if i. \(V_{b}^{k}\) is high; or
                    ii. \(V_{b}^{k}\) is low and \(\theta_{i}\) is false
                \(\Phi_{\text {new }}=\left(c_{b}, p_{i}+p_{t}+E_{n, v}\left(V_{b}^{k}\right)+E_{b}^{k}\right.\),
                                    \(\min \left(q_{t}, q_{i}-d_{n, v}-d_{b}^{k}\left(c_{i}+c_{n, v}\right)\right.\),
                                    \(\theta_{t}\) or \(\left.\left(i f V_{b}^{k}=V_{H}\right)\right)\)
                else goto line 7
                if i. slew rate violation at downstream buffers; or
                ii. \(\Phi_{\text {new }}\) dominated by any \(\Phi_{z} \in \operatorname{Set}\left(\Phi_{n}\right)\)
                drop \(\Phi_{\text {new }}\)
            else
                remove all \(\Phi_{z} \in \operatorname{Set}\left(\Phi_{n}\right)\) dominated by \(\Phi_{\text {new }}\)
                \(\operatorname{Set}\left(\Phi_{n}\right)=\operatorname{Set}\left(\Phi_{n}\right) \cup\left\{\Phi_{\text {new }}\right\}\)
    return \(\operatorname{Set}\left(\Phi_{n}\right)\)
```

Table 2: Dynamic programming for buffer insertion.
There are several new features in our algorithm in order to support the insertion of dual $V_{d d}$ buffers. Line 10 and 12 of Table 2 produce the new options $\Phi_{\text {new }}$ for the cases of no buffer insertion and inserting buffer $b_{k}$ respectively between node $n$ and $v$. In the case of no buffer insertion, we set $V$ to either $V_{H}$ for high $V_{d d}$ or $V_{L}$ for low $V_{d d}$ at line 9 according to the downstream high $V_{d d}$ buffer indicators $\theta_{i}, \theta_{j}$, and line 10 makes use of $V$ to update the power consumed by the interconnect. Note that when $\theta=$ false (ie. there is no high $V_{d d}$ buffers in the downstream), only the low $V_{d d}$ option has to be created since the high $V_{d d}$ counterpart is always inferior. In the case of buffer insertion, we simply add $E_{n, v}\left(V_{b}^{k}\right)$ according to the operational voltage of buffer $b_{k}$ to $p_{\text {new }}$ and update $\theta$ accordingly. Also note that we use line 11 to guard against low $V_{d d}$ buffers driving high $V_{d d}$ buffers to avoid the need of level converters, as explained in Section 2.1.

### 3.2 Power-Delay Sampling

We apply the technique of sampling to the options to reduce the growth of options, which can go to the order of billions for large nets if uncontrolled. The idea is to pick only a certain number of options among all options for upstream propagation (line 2 of Table 2) in the algorithm $D P$. Figure 3 shows (a) the pre-sample and (b) the after-sample option sets under the same capacitance. Each black dot corresponds to an option. We divide each side of the bounding box of all options into equal segments such that the entire
power-delay domain are superposed by a grid. For each grid square in Figure 3(a), we retain only one option if there is any. By also including the smallest power option and the largest RAT option, we obtain the sampled non-dominated option set in Figure 3(b).

(a)

(b)

Figure 3: Sampling the non-dominated options.

Note that we do not sample on to capacitance values. The capacitance value in an option is for the purpose of accurate calculation of power and delay in the upstream of the tree. Moreover, the number of capacitance values is relatively small due to the upper bound slew rate constraint, which means that sampling on capacitance value has little effect anyway.

### 3.3 Experiment

We test our algorithm on 10 testcases $s 1 \sim s 10$ generated by randomly placing source and sink pins in a $1 \mathrm{~cm} \times$ 1 cm box. We use a rectilinear Steiner tree generation package [15] to generate the connection between the source and the sink pins. We also break interconnect between nodes longer than $500 \mu \mathrm{~m}$ by inserting degree- 2 nodes. In this experiment we assume that every non-terminal nodes are candidate buffer nodes. We set the RAT at all sinks to 0 so that the objective becomes minimizing the maximum delay from the source to any sink. Table 1 lists all the technology related settings. The slew rate bound $\hat{s}$ is set to 100ps. We have made buffers using an inverter cascaded with another inverter which is four times larger. Buffer sizes used in the experiment are $16 \mathrm{x}, 32 \mathrm{x}$ and 64 x . We compare three algorithms, which are i. power-optimal buffer insertion $(P B)$ algorithm [1] considering only single (high) $V_{d d}$ buffers; ii. $S V B$ for our $D V B$ algorithm considering only high $V_{d d}$ buffers; and iii. $D V B$ for our $D V B$ algorithm considering dual $V_{d d}$ buffers. In both $S V B$ and $D V B$ we set the sampling grid to $20 \times 20$.


Figure 4: Non-dominated solutions of $s 4$.
Figure 4 shows all non-dominated options at the source node $n_{\text {src }}$ (i.e. valid solutions) of the testcase s4. We observe
that the sampling approximation introduced by our $D V B$ algorithm has almost no impact on the power-delay optimality, as the options from $S V B$ follow those from $P B$ very closely. We also see that introduction of dual $V_{d d}$ buffers in $D V B$ significantly improves the power optimality by pushing all option to the left of the graph.

Table 3 shows the experimental results for the three algorithms that we consider. Since the power and delay values are extremely similar between $P B$ and $S V B$, we omit those for $P B$ to save space. $R A T^{*}$ is the maximum achievable $R A T$ at the source. The percentages in the brackets show the relative change of power from $S V B$ to those in $D V B$. Runtime is measured on a Intel Xeon 1.9 Ghz Linux workstation with 2 Gb of memory. We see that on average using dual $V_{d d}$ buffers reduces power by $23 \%$ compared to the case when only high $V_{d d}$ buffers are considered at $R A T^{*}$. When we relax the $R A T$ at the source to $105 \%$ of $R A T^{*}$, the dual $V_{d d}$ buffer solution saves $26 \%$ of power compared to the high $V_{d d}$ buffer-only solutions. Also notice that $S V B$ is 17 x faster than $P B$ on average.

## 4. BUFFERED TREE CONSTRUCTION

Using the sampling technique in Section 3.2, we attempt to extend the algorithms in $[6,7]$ to handle dual $V_{d d}$ buffered tree construction with power minimization as the objective. The $D$-Tree problem is an NP-Hard problem. In fact, in the case of no $B S$ and blockages, the $D$-Tree problem is essentially the optimal rectilinear Steiner tree problem and is known to be NP-Complete [16]. The artifact of the NPhardness is the exponential growth of the number of options, which is complicated by considering power in addition to delay. We find that if we sample options using a very sparse grid (eg. $2 \times 2$ grid), we end up losing power optimality by dropping too many options. However, a denser grid causes catastrophic increase in runtime if we perform a linear scan for pruning each time the algorithm creates a new option. Therefore, solving the $D$-Tree problem requires a very efficient way of managing options, which has not been considered in $[6,7]$.

The data structure in [1] which uses an augmented orthogonal search tree for option pruning is a good starting point. The authors use a hash table labeled by power values as a container for search trees of capacitance and delay. In their algorithm they always add the options into the tree in the order of increasing capacitance. When combined with their dominance detection scheme, the algorithm adds only non-dominated options into the tree.

However, we cannot directly apply the data structure and operations described in [1] to solving the $D$-Tree problem. In this problem the order of node traversal is not known as a priori due to the combinatorial nature of the path searching problem. Therefore we can no longer guarantee the order by which options are added to the search tree. This may cause dominated options residing in the search tree, which causes the tree to take $O\left((\log m)^{2}\right)$ time (where $m$ is the number of options in the tree) per addition of option to update if balanced trees are used. Moreover, keeping redundant options also worsens the space requirement. Therefore, we need a way to efficiently prune options from the tree in order to retain option non-redundancy.

### 4.1 Dynamic Pruning

We propose an improved data structure, as shown in Fig-

| Testcase |  |  | runtime (s) |  |  | SVB |  | DVB |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| net | $\begin{gathered} \# \\ \text { nodes } \end{gathered}$ | $\begin{gathered} \# \\ \operatorname{sinks} \end{gathered}$ | $\begin{gathered} \text { PB } \\ (\mathrm{s}) \\ \hline \end{gathered}$ | $\begin{aligned} & \text { SVB } \\ & (\mathrm{s})[\mathrm{x}] \end{aligned}$ | DVB | $\begin{gathered} \hline \text { power @ } \\ \text { RAT* } \\ \text { (fJ) } \\ \hline \end{gathered}$ | $\begin{gathered} \text { power }{ }^{@} \\ 105 \% \\ R A T^{*}(\mathrm{fJ}) \\ \hline \end{gathered}$ | $\begin{gathered} \text { power @ } \\ R A T^{*}[\mathrm{x}] \\ (\mathrm{fJ})[\%] \\ \hline \%] \end{gathered}$ | $\begin{gathered} \text { power @ } \\ 105 \% R A T^{*} \\ (\mathrm{fJ})[\%] \\ \hline \end{gathered}$ |
| s1 | 86 | 19 | 3 | $2[1.5]$ | 6 | 4669 | 4127 | 3980 [-15\%] | 3277 [-21\%] |
| s2 | 102 | 29 | 4 | 3 [1.3] | 9 | 5476 | 4844 | 4785 [-13\%] | 3750 [-23\%] |
| s3 | 142 | 49 | 17 | 7 [2.5] | 20 | 8123 | 6316 | 6930 [-15\%] | 4804 -24\%] |
| s4 | 226 | 99 | 224 | 33 [6.8] | 64 | 13232 | 9440 | 11322 [-14\%] | $7876[-17 \%]$ |
| s5 | 375 | 199 | 719 | 86 [8.4] | 212 | 18699 | 15275 | 13808 [-26\%] | 11376 [-26\%] |
| S6 | 515 | 299 | 2121 | 139 [15] | 371 | 23443 | 20117 | 17239 -26\%] | 14703 -27\%] |
| S7 | 784 | 499 | 33419 | 393 [85] | 635 | 33552 | 28336 | 23804 -29\%] | 20221 -29\%] |
| s8 | 1054 | 699 | - | 598 | 1072 | 38351 | 33686 | 25799 [-33\%] | 22985 [-32\%] |
| s9 | 1188 | 799 | - | 853 | 1859 | 40228 | 36358 | 26646 [-34\%] | 23045 [-37\%] |
|  |  |  |  | [17] |  |  |  | [-23\%] | [-26\%] |

Table 3: Experimental result of single and dual $V_{d d}$ buffer insertion.
ure 5, similar to the one in [1] but also support solution pruning from the search trees. We label the hash table using capacitance instead of power and keep the power and $R A T$ portion of options in the tree instead. The slew rate upper bound tends to tightly upper bound maximum value of capacitance and therefore the hash table tends to be smaller, which results in less search trees.


Figure 5: Data structure for option pruning.
The search trees are ordered so that at each node the power value is larger (smaller) than those in the nodes of the left (right) subtree respectively. We always maintain the tree so that no option dominates any other. Following from this, we immediately see that all $R A T q$ are in the same order as power $p$, i.e. the $q$ values in the left (right) subtree of the node $n$ are smaller than (larger) than the RAT $q$ of $n$. Therefore, we do not require explicit maintanance of the largest $R A T$ in the left subtree as in [1].

Our algorithm to prune dominated options from the tree is summarized in Table 4. $\operatorname{Set}\left(\Phi_{n}\right)$, which contains the options at node $n$, are organized in the data structure mentioned above. In the pseudo-code we treat any option $\Phi_{\text {cur }}$ as a node in the search tree, and therefore $\Phi_{\text {cur }} \rightarrow$ left refers to the left child of the node storing the option $\Phi_{\text {cur }}$. We use $T_{\Phi}$ to denote the subtree rooted at $\Phi$. For each capacitance value that is larger than that in the new option $\Phi_{\text {new }}$, line $2 \sim 7$ look for the first option $\Phi_{\text {cur }}$ in the tree that $\Phi_{\text {new }}$ domiantes. If one is found, line $8 \sim 13$ prune the left subtree of $\Phi_{\text {new }}$ with a single downward pass of the tree, which takes only $O(\log m)$ time for $m$ options in the tree, by making use of the special tree ordering. The right subtree of $\Phi_{\text {cur }}$ is also pruned in a similar fashion. Note that after this step, options in the $\operatorname{Set}\left(\Phi_{\text {junk }}\right)$ can be removed and $\Phi_{\text {new }}$ can be inserted as usual in a balanced tree in $O(\log m)$ time. Rotation, which helps balancing the tree, requires no label updating as long as no option in the tree is dominated.

### 4.2 The $D$-Tree Algorithm

Table 5 summarizes the $D$-Tree algorithm. Each option

```
Algorithm CleanDominate \(\left(\Phi_{n e w}, \operatorname{Set}\left(\Phi_{n}\right)\right)\)
    \(\operatorname{Set}\left(\Phi_{j u n k}\right)=\emptyset\)
    for each distinct capacitance \(c>c_{n e w}\) in \(\operatorname{Set}\left(\Phi_{n}\right)\)
        \(\Phi_{\text {cur }}=\) option at the root of the search tree under \(c\)
        while \(\Phi_{\text {cur }} \neq \phi\)
            case 1: \(p_{\text {new }}<p_{\text {cur }}, q_{\text {new }}<p_{\text {cur }}\),
            \(\Phi_{c u r}=\Phi_{\text {cur }} \rightarrow l e f t\)
            case 2: \(\quad p_{\text {new }}<p_{\text {cur }}, q_{\text {new }}>q_{\text {cur }}\), goto line 2
            case 3: \(p_{\text {new }}>p_{\text {cur }}, q_{\text {new }}<q_{\text {cur }}\), goto line 9
            case 4: \(p_{\text {new }}>p_{\text {cur }}, q_{\text {new }}>q_{\text {cur }}\),
            \(\Phi_{\text {cur }}=\Phi_{\text {cur }} \rightarrow\) right
        \(\operatorname{Set}\left(\Phi_{\text {junk }}\right)=\operatorname{Set}\left(\Phi_{\text {junk }}\right) \cup\left\{\Phi_{\text {cur }}\right\}\)
        \(\Phi_{\text {dom }}=\Phi_{\text {cur }} \rightarrow\) left
        while \(\Phi_{\text {dom }} \neq \phi\)
            case 1: \(p_{\text {new }}<p_{\text {dom }}\),
            \(\operatorname{Set}\left(\Phi_{j u n k}\right)=\operatorname{Set}\left(\Phi_{j u n k}\right) \cup\left\{\Phi_{\text {dom }}, T_{\Phi_{\text {dom }} \rightarrow r i g h t}\right\}\)
            \(\Phi_{\text {dom }}=\Phi_{\text {dom }} \rightarrow\) left
            case 2: \(\quad p_{\text {new }}>p_{\text {dom }}\),
            \(\Phi_{\text {dom }}=\Phi_{\text {dom }} \rightarrow\) right
        repeat line \(9 \sim 12\) with modifications:
        i. exchange 'left' and 'right';
        ii. replace \(p_{n e w}\) and \(p_{\text {dom }}\) with \(q_{n e w}\) and \(q_{d o m}\); and
        iii. exchange '<' and '>
```

Table 4: Dynamic tree update.
now stores the "sink set" $\mathcal{S}$ and "reachability set" $\mathcal{R}$ to keep track of the sinks and the other nodes that the current option covers. The algorithm starts by building a grid using the "escape node algorithm" in [7]. Line 1~4 create the candidate buffer insertion nodes $n_{b}^{k}$ by looking for intersection points between $B S$ and the grid lines $\left(n_{i}, n_{j}\right)$. The core process of creating new options $\Phi_{\text {new }}$ considering dual $V_{d d}$ buffers is the same as that in the $D V B$ algorithm (refer to line 8-18 of Table 2) with additional book keeping to track the routability. The new pruning data structure in Section 4.1 is applied at line 17 for pruning options from $\operatorname{Set}\left(\Phi_{j}\right)$.

### 4.3 Experiment

We create 5 testcases g1~g5 by randomly generating source and sink pins in a $1 \mathrm{~cm} \times 1 \mathrm{~cm}$ box. We also randomly generate blockages so that it consumes approximately $30 \%$ of the total area of the box. Horizontal and vertical $B S$ are randomly scattered in the box so that the average distance between two consecutive $B S$ is about $1000 \mu \mathrm{~m}$. The scales of these testcases as a result are similar to those in [6]. We use 32 x and 64 x buffers. We set the RAT of all sinks to 0 so that maximizing $R A T$ at the source corresponds to minimizing the maximum delay from the source to any sink. The slew rate bound $\hat{s}$ is set to 100 ps . We again refer to Table 1 for technology related settings. We compare three cases,

```
Algorithm \(D T R E E\left(n_{s r c}, \operatorname{Set}\left(n_{s}\right)\right.\), Set \(\left.(B S), \operatorname{Set}(B l o c k a g e)\right)\)
    \(\left\{\operatorname{Set}\left(n_{p}\right), \aleph(\operatorname{Set}(n))\right\}=\operatorname{Grid}(\operatorname{Set}(n), \operatorname{Set}(\) Blockage \())\)
    for each node \(n_{i} \in \operatorname{Set}(n)\)
    for each neighbour node \(n_{j} \in \aleph\left(n_{i}\right)\)
        \(\operatorname{Set}(n)=\operatorname{Set}(n) \cup\left\{n_{p}\right.\) created by edge \(\left.\left(n_{i}, n_{j}\right) \cap \operatorname{Set}(B S)\right\}\)
        \(\aleph\left(n_{p}\right)=\left\{n_{i}, n_{j}\right\} ;\) update \(\aleph\left(n_{i}\right), \aleph\left(n_{j}\right)\)
    \(Q\left(\Phi_{n}^{c u r}\right)=\bigcup_{n_{s} \in \operatorname{Set}\left(n_{s}\right)} \operatorname{Set}\left(\Phi_{n}^{s}\right)\)
    while \(Q\left(\Phi_{n}^{c u r}\right) \neq \emptyset\)
        \(\Phi_{n}^{c u r}=\operatorname{pop} Q\left(\Phi_{n}^{c u r}\right)\)
        for each neighbour \(n_{j} \in \aleph\left(n_{\text {cur }}\right)\)
            for each option \(\Phi_{n}^{j} \in \operatorname{sampled} \operatorname{Set}\left(\Phi_{n}^{j}\right)\)
                if \(\left(\Phi_{n}^{j} . \mathcal{R}\right) \cap\left(\Phi_{n}^{{ }_{n}^{n u r}} . \mathcal{R}\right)=\emptyset\)
                    (form \(\Phi_{\text {new }}\) similar to line \(7 \sim 14\) in Table 2)
                    \(\Phi_{n e w} \cdot \mathcal{R}=\left(\Phi_{n}^{j} . \mathcal{R}\right) \cup\left(\Phi_{n e w} \cdot \mathcal{R}\right)\)
                    \(\Phi_{\text {new }} \cdot \mathcal{S}=\left(\Phi_{n}^{j} \cdot \mathcal{S}\right) \cup\left(\Phi_{\text {new }} \cdot \mathcal{S}\right)\)
                    if i. slew rate violation at downstream buffers; or
                        ii. \(\Phi_{\text {new }}\) dominated by any
                        \(\left\{\Phi_{n}^{j}:\left(\Phi_{n e w} \cdot \mathcal{S}\right) \subseteq\left(\Phi_{n}^{j} \cdot \mathcal{S}\right), \Phi_{n}^{j} \in \operatorname{Set}\left(\Phi_{n}^{j}\right)\right\}\)
                drop \(\Phi_{n e w}\)
                else
                        remove \(\left\{\Phi_{n}^{j}:\left(\Phi_{\text {new }} \cdot \mathcal{S}\right) \supseteq\left(\Phi_{n}^{j} \cdot \mathcal{S}\right), \Phi_{n}^{j} \in \operatorname{Set}\left(\Phi_{n}^{j}\right)\right\}\)
                dominated by \(\Phi_{n e w}\)
                    \(\operatorname{Set}\left(\Phi_{n}^{j}\right)=\operatorname{Set}\left(\Phi_{n}^{j}\right) \cup\left\{\Phi_{n e w}\right\}\)
                push \(\Phi_{\text {new }}\) into \(Q\left(\Phi_{\text {cur }}\right)\) if \(n_{j} \neq n_{\text {src }}\)
```

Table 5: Dual $V_{d d}$ buffered tree generation.
which are i. $R M P$ in [6] for timing-aware buffered tree generation; ii. $S$-TREE for our $D$-Tree algorithm considering single (high) $V_{d d}$ buffers; and iii. $D-T R E E$ for $D$-Tree algorithm considering dual $V_{d d}$ buffers. Note that in the original implementation of [6] only options with the smallest capacitance under each reachable set are kept, which the authors claim has minimal impact on $R A T$ optimality through experiment. However, we have found that the validity of this claim has strong correlation with the positions and density of the buffer candidate nodes. Therefore we choose to exclude this speed up heuristic to avoid losing optimal $R A T$.

| Testcase |  | $R M P$ | $S-T R E E$ | $D-T R E E$ |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $\#$ <br> node <br> sink | power <br> $@ R A T^{*}$ <br> $(\mathrm{pJ})$ | power <br> $@ R A T^{*}$ <br> $(\mathrm{pJ})[\%]$ | power <br> @ $R A T^{*}$ <br> $(\mathrm{pJ})[\%]$ | run- <br> time <br> $(\mathrm{s})$ |  |
| 97 | 2 | 1.6 | $1.6[0 \%]$ | $1.5[-7 \%]$ | 1 |
| 165 | 3 | 3.4 | $3.4[0 \%]$ | $3.2[-4 \%]$ | 35 |
| 137 | 4 | 3.9 | $3.5[-10 \%]$ | $2.9[-23 \%]$ | 66 |
| 261 | 5 | 4.9 | $4.4[-13 \%]$ | $3.1[-37 \%]$ | 937 |
| 235 | 6 | 4.2 | $3.8[-10 \%]$ | $3.4[-18 \%]$ | 1391 |
|  |  |  | $-7 \%]$ | $-18 \%]$ |  |

Table 6: Experimental result of timing-aware and dual $V_{d d}$ low power buffered tree generation.

Table 6 shows the experimental results for the five test cases. We compare the power consumption at the maximum achievable $R A T$ of each net. The percentages in the brackets show the reductions of power from the $R M P$ to the $D$-Tree formulation with high and dual $V_{d d}$ buffers respectively. We observe a $7 \%$ reduction through power-minimization using high $V_{d d}$ buffers. Using dual $V_{d d}$ buffers gives $18 \%$ of power reduction over $R M P$. Note that power-optimal solution considering high $V_{d d}$ alone may not yield a better power as shown in the first two testcases, but the extra optimization dimension provided by using dual $V_{d d}$ always helps achieve power savings.

## 5. CONCLUSION AND FUTURE WORK

This paper presents the first in-depth study on applying dual $V_{d d}$ buffers to buffer insertion and multi-sink buffered tree construction for power minimization under delay constraint. We develop a sampling-based sub-solutions (i.e. options) propagation method and a balanced search tree based data structure for option pruning to cope with the increased complexity due to simultaneous delay and power consideration and increased buffer choices. We obtain 17 x speedup with little loss of optimality compared to the exact option propagation [1]. Extensive experimental results show that when dual $V_{d d}$ buffers are considered, our algorithm reduces power by $23 \%$ at the minimum delay specification compared to [1]. Moreover, compared to the delay-optimal tree using single $V_{d d}$ buffers [6, 7], our power-optimal buffered tree reduces power by $7 \%$ and $18 \%$ when single $V_{d d}$ and dual $V_{d d}$ buffers are used respectively.

This work does not consider the issues regarding dual $V_{d d}$ power distribution network - we assume that all buffer stations support dual $V_{d d}$ buffers, which is valid only if the design uses dual $V_{d d}$ circuits extensively throughout the chip. In a lot of multiple $V_{d d}$ ASIC designs, designers assign voltages to regions ( $V_{d d}$ bays). Co-optimization of dual $V_{d d}$ buffered routing and placement of these $V_{d d}$ bays is an interesting problem to pursue.

## 6. REFERENCES

[1] J. Lillis, C. Cheng, and T. Lin, "Optimal wire sizing and buffer insertion for low power and a generalized delay model," in ICCAD, Nov. 1995.
[2] T. Okamoto and J. Cong, "Buffered Steiner tree construction with wire sizing for interconnect layout optimization," in ICCAD, Nov. 1996.
[3] J. Lillis, C. Cheng, and T. Lin, "Simultaneous routing and buffer insertion for high performance interconnect," in GLVLSI Symp., 1996.
[4] C. Alpert, G. Gandham, J. Hu, J. Neves, S. Quay, and S. Sapatnekar, "Steiner tree optimization for buffers, blockages and bays," in ISCAS, May 2001.
[5] J. Hu, C. Alpert, S. Quay, and G. Gandham, "Buffer insertion with adaptive blockage avoidance," TCAD, vol. 22, no. 4, pp. 492-498, 2003.
[6] J. Cong and X. Yuan, "Routing tree construction under fixed buffer locations," in $D A C$, Jun 2000.
[7] W. Chen, M. Pedram, and P. Buch, "Buffered routing tree construction under buffer placement blockages," in $A S P-D A C$, Jan 2002.
[8] F. Li, Y. Lin, and L. He, "Vdd programmability to reduce fpga interconnect power," in ICCAD, Nov 2004.
[9] H. Bakoglu, Circuits, Interconnects and Packaging for VLSI. Addison-Wesley, 1990.
[10] K. Banerjee and A. Mehrotra, "A power-optimal repeater insertion methodology for global interconnects in nanometer designs," TCAD, vol. 49, no. 11, pp. 2001-2007, 2002.
[11] Semiconductor Industry Association, International Technology Roadmap for Semiconductors, 2003.
[12] "Quickcap user manual," in http://www.magma-da.com/.
[13] "Berkeley predictive technology model," in http://www-device.eecs.berkeley.edu/ptm.
[14] C. Alpert, D. Devgan, and C. Kashyap, "RC delay metrics for performance optimization," TCAD, vol. 20, no. 5, pp. 571-582, 2001.
[15] D. Warme, P. Winter, and M. Zachariasen, "Geosteiner," in http://www.diku.dk/geosteiner, 2003.
[16] W. Shi and C. Su, "The rectilinear steiner arborescence problem is NP-complete," in ACM-SIAM Symp. on Discrete Algorithms, 2000.

