# Temperature and Supply Voltage Aware Performance and Power Modeling at Microarchitecture Level

Weiping Liao, Student Member, IEEE, Lei He, Member, IEEE, and Kevin M. Lepak, Member, IEEE

Abstract—Performance and power are two primary design issues for systems ranging from server computers to handhelds. Performance is affected by both temperature and supply voltage because of the temperature and voltage dependence of circuit delay. Furthermore, as semiconductor technology scales down, leakage power's exponential dependence on temperature and supply voltage becomes significant. Therefore, future design studies call for temperature and voltage aware performance and power modeling. In this paper, we study microarchitecture-level temperature and voltage aware performance and power modeling. We present a leakage power model with temperature and voltage scaling, and show that leakage and total energy vary by 38% and 24%, respectively, between 65°C and 110°C. We study thermal runaway induced by the interdependence between temperature and leakage power, and demonstrate that without temperature-aware modeling, underestimation of leakage power may lead to the failure of thermal controls, and overestimation of leakage power may result in excessive performance penalties of up to 5.24%. All of these studies underscore the necessity of temperature-aware power modeling. Furthermore, we study optimal voltage scaling for best performance with dynamic power and thermal management under different packaging options. We show that dynamic power and thermal management allows designs to target at the common-case thermal scenario among benchmarks and improves performance by 6.59% compared to designs targeted at the worst case thermal scenario without dynamic power and thermal management. Additionally, the optimal  $V_{dd}$  for the best performance may not be the largest  $V_{dd}$  allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly.

*Index Terms*—Floorplan, leakage power, microarchitecture, temperature, thermal management.

## I. INTRODUCTION

**S** YSTEM performance and power consumption are two primary design issues for systems ranging from server computers to handhelds. System performance is affected by both temperature and supply voltage  $V_{dd}$  scaling because circuit delay and the maximum system clock frequency depend on both temperature and  $V_{dd}$  [1]. In addition to system performance, within the last ten years, power has become another primary

W. Liao and L. He are with the Electrical Engineering Department, University of California, Los Angeles, CA 90095 USA (e-mail: wliao@ee.ucla.edu; lhe@ee.ucla.edu).

K. M. Lepak is with Advanced Micro Devices, Inc., Austin, TX 78741 USA (e-mail: kevin.lepak@amd.com).

Digital Object Identifier 10.1109/TCAD.2005.850860

design concern [2]. For VLSI circuits, power consumption includes dynamic power and leakage power, both of which strongly depend on  $V_{dd}$ . Furthermore, as semiconductor technology keeps scaling down, leakage power grows significantly at the system level because of: 1) increase of device leakage current due to the reduction in threshold voltage, channel length, and gate oxide thickness [3]; and 2) the increasing number of idle modules in a highly integrated system. For current high-performance design methodologies, the contribution of leakage power increases at each technology generation [4]. The Intel Pentium IV processors running at 3 GHz already have an almost equal amount of leakage and dynamic power [5]. As leakage power becomes important, due to its dependence on temperature, temperature-aware leakage power modeling and dynamic coupled power/thermal management (DPTM) becomes necessary for accurate power estimation and appropriate power/thermal management.

Most existing microarchitecture-level cycle-accurate simulators fail to take into account the temperature and voltage dependence of either performance and power. On the one hand, existing performance simulators [6], [7] use instructions per cycle (IPC) to represent performance and do not consider possible changes in clock frequency with different  $V_{dd}$  and thermal envelopes. This approach is no longer valid with  $V_{dd}$  scaling, considering power/thermal envelopes. A temperature-dependent circuit delay model has been developed [1] which may improve this deficiency in existing microarchitecture simulators. However, there are no existing microarchitecture-level studies considering the impact of temperature-dependent circuit delay. Furthermore, the impact of leakage power on temperature is not considered during performance evaluation.

On the other hand, existing power simulators [8]–[10] calculate leakage power by assuming a fixed ratio between dynamic and leakage power. This assumption is not accurate because dynamic power and leakage power scale differently as a function of  $V_{dd}$  and temperature. Furthermore, leakage power is sensitive to temperature while dynamic power is independent of temperature.

High-level leakage power modeling has been studied. Refs. [11]–[14] all present high-level leakage power models without temperature scaling. Therefore, none of these models is sufficient to study microarchitecture-level power and temperature interaction. Microarchitecture-level thermal modeling has also been studied. Ref. [15] models the on-chip temperature as the average power consumption within a fixed time window. Ref. [16] proposes a simple thermal calculation, applying a one-segment lumped thermal resistance and capacitance circuit to model the entire chip and package. This is extended to model

Manuscript received December 3, 2003; revised May 29, 2004, and September 29, 2004. This paper was supported in part by the National Science Foundation (NSF) CAREER Award CCR-0401682, Semiconductor Research Corporation (SRC) Grant 1116, a UC MICRO grant sponsored by Mindspeed, Fujitsu Laboratories of America, Intel, and a Faculty Partner Award by IBM. This paper was recommended by Associate Editor F. N. Najm.

each module by such a one-segment circuit in [17], where the temperature difference is calculated without horizontal heat transfer. HotSpot [18], [19] provides a detailed thermal model based on an equivalent distributed circuit of thermal resistances and capacitances that correspond to microarchitectural units and the package with heat spreader and heatsink. The thermal calculation in HotSpot considers three dimensional heat transfer. However, both temperature modeling and dynamic thermal management in HotSpot do not consider the temperature and voltage dependence of leakage power.

A limited number of studies consider interdependence between power and temperature. Ref. [17] proposes a leakage power model with temperature scaling for 100-nm technology with an empirical temperature-dependent term  $\exp((-a)/(T-b))$  where a and b are empirical constants and T is the temperature. Voltage scaling is not considered for either dynamic or leakage power in [17]. Ref. [17] considers thermal calculation based on the whole chip and individual modules, but the thermal resistance for all modules are simply empirical. Ref. [20] proposes a thermal model similar to that in [16] and a leakage model with empirical exponential temperature scaling to study reducing power through activity migration. However, no coupled power and thermal management is studied in [20]. Furthermore, [20] does not consider voltage scaling in the power model.

In this paper, we present leakage power models with  $V_{dd}$  and temperature scaling based on the BSIM4 model for subthreshold and gate leakage current,<sup>1</sup> and develop a coupled thermal and power microarchitecture simulator PTscalar [22] which considers the interdependence between leakage and temperature. With PTscalar, we are able to explore various microarchitecture-level leakage power and thermal models as well as coupled power/thermal simulation and management considering the interdependence between leakage power and temperature. We show the dramatic dependence of leakage power on temperature at the microarchitecture level within the temperature range between 65 °C and 110 °C. We also discuss thermal runaway induced by the interdependence of leakage and temperature. We further demonstrate that for dynamic thermal management, underestimating the temperature dependence of leakage leads to violations of temperature constraints and overestimating the temperature dependence of leakage leads to up to 5.24% performance loss due to over-aggressive application of power reduction techniques. These studies underscore the need for temperature-aware power modeling and DPTM.

Furthermore, we present studies on optimal voltage scaling for best performance with DPTM considering voltage scaling. We show that DPTM can increase maximum system throughput by 6.59% compared to designs targeting worst case thermal scenarios without DPTM. Contrary to the widely-accepted belief that scaling to larger  $V_{dd}$  leads to improved performance (through gains in clock frequency), we show that the optimal  $V_{dd}$  for the best performance may not be the largest  $V_{dd}$  allowed by the given package platform. We also study the impact of active cooling techniques providing smaller thermal resistance

<sup>1</sup>In essence, a similar leakage model based on BSIM3 was developed by an independent study [21].

and show that such techniques can improve maximum system throughput by 15.1% compared to conventional air cooling. All these studies indicate the necessity of temperature-aware performance modeling.

The rest of this paper is organized as follows. In Section II, we develop power and delay models with both voltage and temperature scaling. In Section III, we introduce our thermal model, study microarchitectural-level coupled power and thermal simulation, and discuss the thermal runaway induced by the interdependence between leakage and temperature. In Section IV, we study the importance of coupled power and thermal management. In Section V, we study optimal voltage scaling for the best performance with dynamic power and thermal management under different packaging options. We conclude in Section VI.

# II. POWER AND DELAY MODEL WITH TEMPERATURE AND VOLTAGE SCALING

## A. Power Model With Temperature and Voltage Scaling

We define three power states, as follows.

- 1) Active mode, where a circuit performs an operation and dissipates both dynamic power  $(P_d)$  and leakage power  $(P_s)$ . The sum of  $P_d$  and  $P_s$  is active power  $(P_a)$ .
- 2) Standby mode, where a circuit is idle but ready to execute an operation, and dissipates only leakage power  $(P_s)$ .
- 3) *Inactive mode*, where a circuit is deactivated by power gating [23] or other leakage reduction techniques, and dissipates a reduced leakage power defined as inactive power  $(P_i)$ . A circuit in the inactive mode requires a nonnegligible amount of time to wake up and then perform a useful operation [10].

Dynamic energy is consumed by charging and discharging capacitances. It is independent of temperature, but has a quadratic dependence on supply voltage. In our experiment, dynamic energy in each clock cycle is calculated as  $CV^2$ .

In the rest of this subsection, we discuss our leakage power model with  $V_{dd}$  and temperature scaling. It has been shown in [24] that leakage power mainly consists of subthreshold and gate leakage power. Each type of leakage exhibits a different temperature and  $V_{dd}$  dependence. More importantly, the two manifest themselves at different conditions and the worse-case leakage power is not the simple sum of the worst case subthreshold and gate leakage power.

1) Subthreshold Leakage Power Models: We study subthreshold leakage power modeling for two types of circuits: one is logic circuits such as functional units, the other is memory-based units such as caches and register files, modeled by SRAM arrays.

For logic circuits, we use the leakage power model proposed in [25]. As shown in (1), for a given circuit, the leakage power can be calculated as the product of the number of gates  $(N_{\text{gate}})$ and the average subthreshold leakage current per gate  $(I_{\text{avg}}^{\text{sub}})$ 

$$P_{\rm sub} = N_{\rm gate} \cdot I_{\rm avg}^{\rm sub} \cdot V_{dd}.$$
 (1)

 $I_{\text{avg}}^{\text{sub}}$  can be calculated by computing the average leakage current per gate for the given *n* circuits using gate-level estimation. Because leakage current depends on different input vectors [11],



Fig. 1.  $I_{\rm avg}$  of random logic. The circuits are selected from MCNC'91 benchmark set [26] including circuits for ALU, control, multiplier, decoder, counter, etc.

we apply a genetic algorithm presented in [25] to obtain the input vectors for both maximum and minimum leakage currents. First, the solution and input vector are encoded into a string so that the length of the string is equal to the number of primary inputs. The initial population is randomly generated. After that, each interaction follows these procedures:

- 1) evaluate the fitness value of each string;
- 2) apply tournament selection;
- 3) apply crossover and mutation schemes;
- 4) produce the new generation.

Finally, the algorithm stops after the number of generations exceeds a pre-defined number. We then calculate  $I_{avg}^{sub}$  with the input vectors obtained by this algorithm. Fig. 1 shows this  $I_{avg}^{sub}$  calculated with respect to the number of circuits. The circuits are selected from MCNC'91 benchmark set [26] including circuits for ALU, control, multiplier, decoder, counter, etc. It is easy to see that after the number of circuits exceeds 20, the value of  $I_{avg}^{sub}$  becomes stable for both maximum and minimum leakage current when these circuits are designed using the same cell library. Also shown in Fig. 1, the average difference between maximum and minimum  $I_{avg}^{sub}$  is about 60% of the minimum  $I_{avg}^{sub}$ .

A formula similar to (1) has been proposed in [13] which explicitly considers the statistical impacts of transistor stacking. However, no explicit method is proposed in [11] and [13] to consider voltage and temperature scaling. We characterize the temperature and voltage scaling of  $I_{\text{avg}}^{\text{sub}}$  based on the following BSIM4 subthreshold leakage current model [4]:

$$I_{\rm sub} = A e^{\frac{\left(V_{\rm GS} - V_T - \gamma' V_{\rm SB} + \eta V_{\rm DS}\right)}{n V_{\rm TH}}} \left(1 - e^{-\frac{V_{\rm DS}}{V_{\rm TH}}}\right) \tag{2}$$

$$A = \mu_0 C_{\rm ox} \frac{W}{L_{\rm eff}} V_{\rm TH}^2 e^{1.8}$$
(3)

where  $V_{\rm GS}$ ,  $V_{\rm DS}$ , and  $V_{\rm SB}$  are the gate-source, drain-source, and source-bulk voltages, respectively,  $V_T$  is the zero-bias threshold voltage,  $V_{\rm TH}$  is the thermal voltage kT/q,  $\gamma'$  is the linearized body-effect coefficient,  $\eta$  is the drain induced barrier lowering

 TABLE I

 EMPIRICAL CONSTANTS IN (17) AND (18) FOR 65-nm TECHNOLOGY. THESE

 CONSTANTS ARE THE SAME FOR CASES WITH AND WITHOUT POWER GATING

| Х       | Y        | Z   |
|---------|----------|-----|
| 0.20306 | -0.25289 | 1.0 |

(DIBL) coefficient,  $\mu_0$  is the carrier mobility,  $C_{\text{ox}}$  is gate capacitance per area, W is the width and  $L_{\text{eff}}$  is the effective gate length.

From (2) we can see the temperature scaling for subthreshold leakage current is  $T^2e^{1/T}$ , where T is the temperature, and the voltage scaling for leakage current is  $e^{V_{dd}}$ . Based on these observation, we propose the following formula for  $I_{\text{avg}}^{\text{sub}}$  considering temperature and voltage scaling:

$$I_{\text{avg}}^{\text{sub}}(T, V_{dd}) = I_s^{\text{sub}}(T_0, V_0) \cdot T^2 \cdot e^{\left(\frac{\alpha_{s1} \cdot V_{dd} + \beta_{s1}}{T}\right)}$$
(4)

where  $I_s^{\text{sub}}$  is a constant current at the reference temperature  $T_0$  and voltage  $V_0$ .  $\alpha_{s1}$  and  $\beta_{s1}$  in (4) are empirical constants decided by circuit designs.

Memory-based units such as caches and register files are usually modeled by SRAM arrays. A formula-based subthreshold leakage power model without temperature and voltage scaling has been proposed in [10]. We use a similar model in this work:

$$P_{\rm sub} = P_{\rm ckts}^{\rm sub} + P_{\rm cells}^{\rm sub} \tag{5}$$

$$P_{\rm ckts}^{\rm sub}(T, V_{dd}) = (X \cdot \text{words} \cdot \text{word_size} + Y \cdot \text{word_size})$$

$$\cdot V_{dd} \cdot T^2 \cdot e^{(\alpha_{s2} \cdot v_{dd} + \beta_{s2}/1)} \tag{6}$$

$$P_{\text{cells}}^{\text{sub}}(T, V_{dd}) = (Z \cdot \text{words} \cdot \text{word\_size}) \cdot V_{dd} \cdot T^2 \cdot e^{(\alpha_{s3} \cdot V_{dd} + \beta_{s3}/T)}$$
(7)

where  $P_{\text{cells}}^{\text{sub}}$  is the subthreshold leakage power dissipated by SRAM memory cells and proportional to the number of SRAM memory cells.  $P_{\text{ckts}}^{\text{sub}}$  is the power generated by accompanying circuits such as wordline drivers, precharge transistors, etc.  $P_{\text{cells}}^{\text{sub}}$  and  $P_{\text{ckts}}^{\text{sub}}$  essentially have the same format as (1) where  $X \cdot \text{words} \cdot \text{word-size} + Y \cdot \text{word-size}$  in (6) and  $Z \cdot \text{words} \cdot \text{word-size}$  in (7) can be viewed as  $N_{\text{gate}}$ . X, Y, $Z, \alpha_{s2-s3}$  and  $\beta_{s2-s3}$  in (6) and (7) are empirical constants decided by circuit designs.

2) Gate Leakage: In the BSIM4 gate leakage model [27], gate leakage current is calculated as gate direct tunneling current--including tunneling current between gate and substrate  $(I_{gb})$  and current between gate and channel  $(I_{gc})$ . The formulas for both  $I_{gb}$  and  $I_{gc}$  are

$$I_{gb} = W_{\text{eff}} \cdot L_{\text{eff}} \cdot X_1 \cdot \left( EXP_{\text{acc}} + EXP_{\text{inv}} \right)$$

$$I_{ac} = W_{\text{eff}} \cdot L_{\text{off}} \cdot X_2$$
(8)

$$g_{c} - W_{eff} \cdot L_{eff} \cdot \Lambda_{2} = (-B_{3} \cdot T_{ox} \cdot (\alpha_{3} - \beta_{3} \cdot V_{oxdepiny}) \cdot (1 + \gamma_{3} \cdot V_{oxdepiny}))$$
(0)

where

$$X_1 = A_1 \cdot T_{\text{oxRatio}} \cdot V_{gb} \cdot V_{\text{uax}} \tag{10}$$

$$EXP_{\rm acc} = e^{(-B_1 \cdot T_{\rm ox} \cdot (\alpha_1 - \beta_1 \cdot V_{\rm oxacc}) \cdot (1 + \gamma_1 \cdot V_{\rm oxacc}))}$$
(11)

$$EXP_{\rm inv} = e^{(-B_2 \cdot T_{\rm ox} \cdot (\alpha_2 - \beta_2 \cdot V_{\rm oxdepinv}) \cdot (1 + \gamma_2 \cdot V_{\rm oxdepinv}))}$$
(12)

$$X_2 = A_2 T_{\text{oxRatio}} V_{\text{gse}} V_{\text{uax}}.$$
(13)

 $A_1, A_2, B_1, B_2, B_3, \alpha_1, \alpha_2, \alpha_3, \beta_1, \beta_2, \beta_3, \gamma_1, \gamma_2$  and  $\gamma_3$  are all empirical constants given by BSIM4 gate leakage model,

|             | A          | B          | $\alpha$ | $\beta$     | $\gamma$ | δ       |
|-------------|------------|------------|----------|-------------|----------|---------|
| $I_{avg}$   | 1.1432e-12 | 1.0126e-14 | 466.4029 | -1224.74083 | 6.28153  | 6.9094  |
| $P_{ckts}$  | 1.1432e-12 | 1.3906e-13 | 466.4029 | -1224.74083 | 6.6943   | 4.46958 |
| $P_{cells}$ | 2e-12      | 2.0581e-13 | 930.1355 | -1712.5319  | 6.6943   | 4.46958 |

 TABLE II
 II

 COEFFICIENTS FOR THE SCALING FUNCTION IN (19) FOR DIFFERENT CIRCUITS IN 65-nm TECHNOLOGY

#### TABLE III

COMPARISON BETWEEN OUR FORMULA AND SPICE SIMULATION.  $I_{avg}$  is for Logic Circuits.  $P_{so}$  is Standby Power for SRAM Power Model. The SRAM ARRAYS ARE REPRESENTED AS "ROW NUMBER" X "COLUMN NUMBER". THE UNITS FOR  $I_{avg}$  and SRAM Power are uA and uW, Respectively

|                    |                  |          | Iavg       | (uA)   |             |
|--------------------|------------------|----------|------------|--------|-------------|
| Circuit            | Temperature (°C) | $V_{dd}$ | formula    | SPICE  | abs. err. % |
|                    | 100              | 0.95     | 23.44      | 23.56  | 0.49        |
| logic circuits for | 100              | 1.05     | 29.56      | 29.63  | 0.23        |
| adder, multiplier, | 80               | 0.95     | 19.44      | 19.54  | 0.56        |
| and shifter        | 80               | 1.05     | 25.14      | 25.21  | 0.27        |
|                    | 60               | 0.95     | 16.00      | 16.11  | 0.65        |
|                    | 60               | 1.05     | 21.33      | 21.39  | 0.31        |
|                    |                  |          | $P_{so}$ ( | (uW)   |             |
| Circuit            | Temperature (°C) | $V_{dd}$ | formula    | SPICE  | abs. err. % |
| SRAM 128x32        | 100              | 0.95     | 181.91     | 188.18 | 3.54        |
|                    | 100              | 1.05     | 262.71     | 271.42 | 3.31        |
| SRAM 512x32        | 100              | 0.95     | 729.11     | 753.38 | 3.33        |
|                    | 100              | 1.05     | 1052.8     | 1086.5 | 3.21        |

7)

 $W_{\text{eff}}$  and  $L_{\text{eff}}$  are the channel width and length, respectively;  $T_{\text{oxRatio}}$ ,  $V_{\text{uax}}$  are defined in BSIM4 gate leakage model.

From (8) and (9), we can see that in contrast to subthreshold leakage, gate leakage is insensitive to temperature. However, gate leakage is dependent on  $V_{dd}$  in the form of  $e^{V_{dd}}$ .

3) Total Leakage Power: Combining subthreshold leakage and gate leakage, we still keep the format of formulas in our subthreshold leakage power model as in (1) and (5)–(7), but take into account the different scaling feature for subthreshold leakage and gate leakage. With this framework in place, we consider both subthreshold and gate leakage power for logic circuits and memory-based units as shown in (14)–(18)

$$P_{s\_\log} = N_{\text{gate}} \cdot I_{\text{avg}} \cdot V_{dd} \tag{14}$$

$$I_{\text{avg}}(T, V_{dd}) = I_s(T_0, V_0) \cdot f_{\text{avg}}(T, V_{dd})$$
(15)

$$P_{s\_\text{mem}} = P_{\text{ckts}} + P_{\text{cells}} \tag{16}$$

$$P_{\text{ckts}}(T, V_{dd}) = (X \cdot \text{words} \cdot \text{word\_size} + Y \cdot \text{word\_size})$$

$$V_{dd} \cdot f_{\text{ckts}}(T, V_{dd}) \tag{1}$$

$$P_{\text{cells}}(T, V_{dd}) = (Z \cdot \text{words} \cdot \text{word\_size})$$
$$\cdot V_{dd} \cdot f_{\text{cells}}(T, V_{dd})$$
(18)

where  $P_{s\_\log}$  is the total leakage power for logic circuits,  $I_{avg}$  is the total leakage current per gate,  $I_s$  is the  $I_{avg}$  at given temperature  $T_0$  and supply voltage  $V_0$ ,  $P_{s\_mem}$  is the total leakage power for memory-based units,  $P_{ckts}$  and  $P_{cells}$  are the total leakage power for SRAM cells and accompanying circuits, respectively,  $f_{avg}(T, V_{dd})$ ,  $f_{ckts}(T, V_{dd})$  and  $f_{cells}(T, V_{dd})$  are scaling functions to characterize temperature and  $V_{dd}$  scaling considering both subthreshold and gate leakage. All three scaling functions  $f_{avg}$ ,  $f_{ckts}$  and  $f_{cells}$  have the same format as (19)

$$f(T, V_{dd}) = A \cdot T^2 \cdot e^{((\alpha \cdot V_{dd} + \beta)/T)} + B \cdot e^{(\gamma \cdot V_{dd} + \delta)}$$
(19)

where  $A, B, \alpha, \beta, \gamma$ , and  $\delta$  are empirical constants for different circuit types, technologies and designs. Notice there is one

temperature dependent scaling term for subthreshold leakage current and one temperature independent scaling term for gate leakage current in (19). Each empirical constant is different for different scaling functions. The value of A, B,  $\alpha$ ,  $\beta$ ,  $\gamma$ , and  $\delta$  as well as validation of our power model will be presented in Section II-A4.

4) Leakage Model Validation: We obtain the constants in (17)–(19) empirically by determining the power consumption for different circuit types at multiple temperatures using SPICE simulations and then applying curve fitting. In our experiments we use the input vectors which maximize subthreshold leakage power for each type of circuit. We choose 65-nm technology. The design parameters for such technology are obtained from Berkeley Predictive Technology Models [28]. For  $I_{\text{avg}}$ , we use the average leakage current for three types of circuits with different bit-width: adder (4-bit, 16-bit, and 32-bit), shifter (8-bit, 16-bit, and 32-bit), and multiplier (4-bit, 5-bit, and 6-bit). We provide gate-level netlist to each type of circuits for simulation. For SRAM arrays, we use different combination of row and column. Different temperatures are chosen during curve fitting and verification. Tables I and II summarize the empirical constants. Table III compares our high-level leakage power estimation for logic circuits and SRAM arrays with SPICE simulations in 65-nm technology. As shown in Table III, the logic circuits have small error less than 1%. For the SRAM arrays, our leakage model achieves similar small errors (less than 1%) for SRAM cells  $P_{cells}$ . However, the power estimation error for the accompanying circuits  $P_{\text{ckts}}$  is large (up to 30%). Therefore, the final error becomes 3.5% when the two parts add up for total leakage power  $P_{s\_mem}$ . This error margin is acceptable for the study in this paper, and a more detailed modeling of  $P_{\rm ckts}$  is not developed here. Overall, the difference between our formulas and SPICE simulation is less than 4%, indicating the formulas for high-level leakage power estimation achieve reasonable accuracy.

TABLE IV COMPARISON BETWEEN OUR FORMULA AND SPICE SIMULATION FOR CIRCUIT DELAY IN PS OF AN INVERTER WITH FO-4 LOAD

| $T(^{o}C)$ | $V_{dd}$          | SPICE                   | Formula                 | Error (%)            |
|------------|-------------------|-------------------------|-------------------------|----------------------|
| 60         | 0.9               | 31.17                   | 33.53                   | 7.57                 |
| 60         | 1.1               | 28.42                   | 30.08                   | 5.85                 |
| 80         | 0.9               | 38.31                   | 35.94                   | 6.17                 |
| 80         | 1.1               | 30.65                   | 32.24                   | 5.19                 |
| 100        | 0.9               | 40.27                   | 38.38                   | 4.71                 |
| 100        | 1.1               | 32.94                   | 34.42                   | 4.51                 |
|            | 1.1<br>0.9<br>1.1 | 30.65<br>40.27<br>32.94 | 32.24<br>38.38<br>34.42 | 5.19<br>4.71<br>4.51 |

## B. Delay Model With Voltage and Temperature Scaling

For VLSI circuits, the relationship between circuit delay and supply voltage  $V_{dd}$  is delay  $\propto V_{dd}/(V_{dd} - V_t)^{\xi}$ , where  $V_t$  is the threshold voltage and  $\xi$  is an empirical constant. Temperature also affects circuit delay by affecting carrier mobility and threshold voltage [29]. The delay model with temperature and voltage scaling is

$$delay \propto \frac{V_{dd}T^{\mu}}{(V_{dd} - V_t)^{\xi}}$$
(20)

where  $\mu$  and  $\xi$  are empirical constants for different technology. We obtain  $\mu = 1.19$  and  $\xi = 1.2$  for 65-nm technology by SPICE simulation and curve fitting empirically. Table IV compares our delay model with SPICE simulation for circuit delay of an inverter with load of FO-4, where we use the formula delay =  $2.3351 \times 10^{-15} (V_{dd} T^{1.19} / (V_{dd} - V_t)^{1.2})$ .<sup>2</sup> The absolute error is within 8%.

By assuming the maximum clock frequency  $f_{\text{max}} = 1/\text{delay}$ , the appropriate supply voltage to achieve  $f_{\text{max}}$  can be decided by

$$f_{\rm max} \propto \frac{(V_{dd} - V_t)^{1.2}}{V_{dd}T^{1.19}}.$$
 (21)

## **III. COUPLED POWER AND THERMAL SIMULATION**

# A. Thermal Model

According to the well-known duality between heat transfer and electrical phenomena [30], temperature can be modeled by equivalent RC thermal circuits, where two parameters: thermal resistance  $R_t$  and thermal capacitance  $C_t$  are used to characterize thermal behavior. We develop our thermal calculation based on the equivalent RC thermal circuits presented in the HotSpot toolset [19]. As shown in Fig. 2 from [19], the equivalent RC thermal circuit consists of three layers: heatsink, heat spreader and chip die. The chip die is partitioned into functional blocks according to microarchitecture functionality. The heat spreader is divided into five blocks: one for the area right under the die and four trapezoids for the periphery not covered by the die. Similar to heat spreader, the heat sink is divided into five blocks. For each block, there are two types of RC pairs to capture both vertical and horizontal heat transfer characteristics: The vertical RC pairs connect the center of each block down to the center of the next layer, to model the vertical



Fig. 2. Side view of IC package [19].

heat transfer between layers. The lateral RC pairs connect the center of each block to the center of the cross section between this block and adjacent blocks in the same layer. The lateral RC pairs characterize the horizontal heat transfer between blocks within each layer. For each RC pair, the thermal resistance  $R_t$  is proportional to the thickness of the block and inversely proportional to the cross-sectional area across which the heat is being transferred. In contrast, the thermal capacitance  $C_t$  is directly proportional to both thickness and area. Provided the average power within a time period, the transient temperature is calculated by solving the differential equations for the RC circuit with a fourth-order Runge–Kutta method [19].

The thermal time constants ( $\tau = R_t * C_t$ ) for blocks are usually on the order of milliseconds — millions of times larger than clock cycles. Therefore, it is not necessary to update temperature and power for every clock cycle. During simulation, we update temperature and power after each time step  $t_s$ . An appropriate value of  $t_s$  can greatly reduce simulation overhead while maintaining accurate temperature calculation. Details of selecting  $t_s$  are given in Section III-C.

## B. Experiment Settings

We choose 65-nm technology [28] in our experiments. Although our power model is applicable to any instruction set architecture and microarchitecture, we study out of order superscalar architectures in this paper. We integrate our power model and temperature calculation into the SimpleScalar 3.00b toolset [6] with Alpha ISA<sup>3</sup> and name the new coupled power and thermal simulator PTscalar. Table V presents the microarchitectural processor configuration. We partition the microprocessor for power/thermal modeling by major functional components. As shown in Table VI, there are two types of components: memory-based units and logic circuits. When calculating the power of memory-based units, we first partition the component into pieces of SRAM arrays with the CACTI 3.0 toolset [31], then apply our formulas for power consumption to each SRAM array. The total component power consumption is the sum of power for all SRAM sub-arrays. Among logic circuits, for integer ALUs and FPUs, we take the area in the design of the Alpha 21264 processor in 350-nm technology [32] and scale down to 65-nm technology by assuming the area is proportional to the square of the feature size. For all other logic circuits, we

<sup>&</sup>lt;sup>2</sup>Note that the constant is only for the inverter delay presented Table IV and not used elsewhere. What we really focus on is the voltage and temperature scaling relationship for circuit delay.

<sup>&</sup>lt;sup>3</sup>Note that our leakage power and delay models with temperature and voltage scaling are independent of processor architecture and microarchitecture simulators. Instead of focusing on a specific architecture or processor design, our studies try to present the importance of temperature and voltage aware modeling, and discover the trend for future designs.

| Parameter            | Value                            |
|----------------------|----------------------------------|
| P                    | rocessor Core                    |
| RUU size             | 64 instructions                  |
| LSQ size             | 32 instructions                  |
| Fetch Queue size     | 8 instructions                   |
| Fetch width          | 4 instructions/cycle             |
| Decode width         | 4 instructions/cycle             |
| Issue width          | 4 instructions/cycle             |
| Commit width         | 4 instructions/cycle             |
| Functional Units     | 3 integer addition, 1 integer    |
|                      | multiplication/division,         |
|                      | 1 FP addition,                   |
|                      | 1 FP multiplication/division     |
| Branch Predictor     | Combined, Bimodal 4K table       |
|                      | 2-Level 1K table, 10-bit history |
|                      | 4K chooser                       |
| BTB                  | 512 entries, 4-way               |
| Mei                  | mory Hierarchy                   |
| L1 instruction-cache | 64KB, 4-way (LRU)                |
|                      | 32B blocks, 1-cycle latency      |
| L1 data-cache        | 64KB, 4-way (LRU)                |
|                      | 32B blocks, 1-cycle latency      |
| L2                   | Unified, 4MB, 8-way (LRU)        |
|                      | 128B blocks, 12-cycle latency    |
| TLB                  | 128 entry, fully associative     |
|                      | 30-cycle miss latency            |

TABLE V SIMULATED MICROPROCESSOR CONFIGURATION

TABLE VI Components in our Experiments

255-cycle latency

Main memory

| Component type | Microarchitecture structure                   |
|----------------|-----------------------------------------------|
| Memory-based   | Caches, register files, TLB,                  |
| units          | branch predictor, register update unit (RUU), |
|                | load/store queue (LSQ), rename table (RAT)    |
| Logic circuits | Integer and floating-point                    |
|                | functional units                              |

TABLE VII Power in mW for all Components for 65-nm Technology, the Supply Voltage is 0.9 V and the Clock Frequency is 5 GHz. The Decode, Integer ALU and FPU are Only One Unit Among Total Four, Four, and Two Units. The Temperature is 100 °C

| Component                    | $P_a$    | $P_s$    | $P_i$  |
|------------------------------|----------|----------|--------|
| BTB                          | 639.41   | 87.39    | 18.93  |
| L1 Instruction Cache         | 770.16   | 222.55   | 8.90   |
| L1 Data Cache                | 732.09   | 222.60   | 8.90   |
| Unified L2 Cache             | 20580.31 | 13123.87 | 524.95 |
| Integer Register File        | 56.20    | 1.57     | 0.06   |
| Floating-point Register File | 56.20    | 1.57     | 0.06   |
| RUU                          | 66.49    | 3.48     | 0.15   |
| LSQ                          | 112.40   | 3.14     | 0.19   |
| One Decode Unit              | 30.38    | 1.60     | 0.06   |
| One Integer ALU              | 554.60   | 11.46    | 0.11   |
| One Floating-point Unit      | 1122.45  | 21.57    | 0.22   |

estimate gate count according to the designs in [33], and then apply formula (1) to calculate the leakage power for logic circuits. Table VII summarizes the power consumption for all components in our system. Similar to other microarchitecture-level power simulators [9], [19], we do not consider the control logic as one component. The floorplan<sup>4</sup> we choose is shown in Fig. 3. The thermal model extracts the thermal resistance  $R_t$  and thermal capacitance  $C_t$  according to this floorplan. To consider appropriate supply voltage scaling for varying clock frequencies, we assume that  $V_{dd} = 0.9$  V obtains  $f_{max} = 5$  GHz as specified by the ITRS [34]. According to (21), the  $f_{max}$  for different  $V_{dd}$  and maximum temperature T allowed for the circuits in our experiments are shown in Table VIII.

## C. Speedup of Coupled Power and Thermal Simulation

We update temperatures after each time step  $t_s$ , and then update the power value with respect to the newly calculated temperature for each  $t_s$ . Smaller  $t_s$  gives a more accurate transient temperature analysis (e.g.,  $t_s = 1$  cycle represents the cycle accurate temperature calculation). Fig. 4 plots the transient temperature of the BTB calculated using different  $t_s$  shown as the percentages of the thermal time constant, where 0.5% of the thermal time constant is equal to 50 000 clock cycles for a 5-GHz clock frequency. When  $t_s \leq 50\,000$  cycles (i.e., 0.5%) of thermal constants), the temperatures are identical to those with  $t_s = 1$  cycle. Observable difference appears when  $t_s$  is increased to 5% of the thermal constants and significant error is induced when  $t_s = 25\%$  of the thermal constants. Furthermore, Table IX compares the simulation time with temperature calculation to a simulation without temperature calculation. By setting  $t_s$  to 50 000 cycles, we not only introduce negligible error on temperature calculation, but also reduce run time by more than 23 times compared to  $t_s = 1$  cycle, and achieve virtually the same computation efficiency as power simulation without temperature calculation. Since the clock frequencies are always faster than 5 GHz in our experiments, 0.5% of thermal constants are always more than 50 000 cycles. Since  $t_s = 50\,000$  cycles leads to negligible error on temperature calculation, we use this value for  $t_s$  throughout the rest of the paper.

#### D. Temperature Dependent Leakage Power

Fig. 5 shows the experimental results for total leakage power consumption at two different temperatures. From Fig. 5 we can see that by changing the temperature from 65 °C to 110 °C, the total leakage energy can be changed by 38%. Fig. 5 clearly shows that any study regarding leakage power is not accurate if the temperature dependence of leakage power is not considered. Since leakage is a nontrivial component of total power for common temperatures, by extension, the temperature dependence of total power must also be considered.

As an engineering approximation, one might consider assuming a fixed temperature appropriate for the processor and package, and then use leakage values at this reference temperature instead of directly considering the temperature variation of leakage power. There are many caveats to this

<sup>&</sup>lt;sup>4</sup>Note that the floorplan is an input of our tool and our tool can consider different floorplans. Again, in our study do not focus on a specific architecture or processor design.



Fig. 3. Floorplan used in our experiments. (a) Floorplanning without L2 Cache. (b) Full-chip floorplanning.



575

5.57

6.22

6.03

6.61

6.41

100

110

5.16

5.00

TABLE VIII



Fig. 4. Temperature curve of the BTB for different time step  $t_s$ . The time constant is 2 ms. The clock frequency is 5 GHz and  $V_{dd}$  is 0.9 V. 0.5%, 5% and 25% of thermal time constant corresponds to 50 thousand, 500 thousand and 2.5 million cycles, respectively. The benchmark is *gcc*.

TABLE IX NORMALIZED RUN TIME FOR VARYING PERIODS OF TEMPERATURE UPDATE. THE N.T. MEANS WE DO NOT HAVE TO UPDATE TEMPERATURE AND POWER DURING THE WHOLE SIMULATION

| $t_s$ (cycle) | N.T. | 1     | 100  | 1000 | 10000 | 50000 |
|---------------|------|-------|------|------|-------|-------|
| Running time  | 1.0  | 23.94 | 5.52 | 1.44 | 1.04  | 1.004 |

approach. First, with dynamic throttling such as clock gating,<sup>5</sup> it is difficult to decide the appropriate reference temperature *a priori* without cycle-accurate simulation with a temperature

<sup>5</sup>The definition of clock gating will be discussion in Section III-F.



Fig. 5. Total power consumption with the breakdown of dynamic and leakage portions. The clock frequency is 6.03 GHz and  $V_{dd}$  is 1.3 V. Clock gating is applied and removes 75% of dynamic power every idle cycle.

dependent leakage model since power and temperature are interrelated. Second, because different benchmarks will exhibit different thermal behavior, and unequal ratios between static and dynamic power, reference temperatures with this simple model are benchmark-dependent. Even with this careful consideration, since leakage power is strongly dependent on temperature, minor temperature variations can lead to large estimation errors in power and thermal simulation with potentially hazardous consequences (see Sections III-E, IV-A1, and IV-A2). Therefore, coupled power and thermal management is necessary. We have shown through this work that coupled power and thermal simulation is indeed highly practical for existing simulation tools.

#### E. Thermal Runaway

The thermal runaway problem in MOSFETs due to the positive feedback loop between on-resistance, temperature and power is well known [35]. In this section, we will present another thermal runaway problem due to the interaction between leakage power and temperature. As component temperature increases, its leakage power increases exponentially. The increase of power consumption can further increase the temperature until the component is in thermal equilibrium with the package's heat removal ability. But if the heat removal ability is not adequate, and the temperature and leakage power interact in a positive feedback loop, both can increase to infinity, leading to thermal runaway and catastrophic thermal failure. Assuming no throttling,<sup>6</sup> for transient temperature  $T_0$  and  $T_1$  at consecutive times  $t_0$  and  $t_1$  and corresponding power  $P(T_0)$  and  $P(T_1)$ , we define the following two criteria as sufficient and necessary conditions<sup>7</sup> for thermal runaway:

- 1)  $T_1 > T_0$  (i.e., the temperature should be increasing).
- 2) the increment of power is larger than the increment of package's heat removal ability. The package's heat removal ability is defined as  $P_o(T) = (T-T_a)/(R_t)$  where  $T_a$  and  $R_t$  are ambient temperature and thermal resistance, respectively.

In addition to temperatures, the second criterion requires knowledge of runtime power and  $T_a$ . We can simplify the second criterion with Theorem 1.

Theorem 1: Criterion (2) is equivalent to  $(d^2T/dt^2) > 0$ , where T is temperature and t is time.

*Proof:* Suppose three different temperatures  $T_1, T_2$  and  $T_3$  are measured at consecutive times  $t_0, t_1$  and  $t_2$ , where  $t_1 - t_0 = t_2 - t_1 = \Delta t$  and  $\Delta t$  is a small time period, then  $(d^2T/dt^2) > 0$  is equivalent to

$$\frac{\frac{T_3 - T_2}{\Delta t} - \frac{T_2 - T_1}{\Delta t}}{\Delta t} > 0.$$
(22)

Suppose for power P, it eventually converts to temperature increment  $\delta T$  and the relationship is given by a function Fwhere  $\delta T = F(P)$ . It is easy to observe that the function Fis monotonic increasing (e.g.,  $\forall P_1, P_2$  and  $P_1 < P_2$ , we have  $F(P_1) < F(P_2)$ ), given the fact that the larger the power, the greater the temperature increment it creates.

The temperature changes from  $T_1$  to  $T_2$  due to the difference between power  $P_1$  and the heat removed as  $(T_1 - T_a)/R_t$ , therefore, we have

$$T_2 - T_1 = F\left(P_1 - \frac{(T_1 - T_a)}{R_t}\right).$$
 (23)

Similarly, we can derive

$$T_3 - T_2 = F\left(P_2 - \frac{(T_2 - T_a)}{R_t}\right).$$
 (24)

Equation (22) is equivalent to  $T_3 - T_2 > T_2 - T_1$ . According to the monotonic property of function *F*, this condition can be presented as (25) and then be expressed as (26):

$$P_2 - \frac{T_2 - T_a}{R_t} > P_1 - \frac{T_1 - T_a}{R_t}$$
(25)

$$P_2 - P_1 > \frac{T_2 - T_a}{R_t} - \frac{T_1 - T_a}{R_t}$$
(26)

where (26) is the exact expression for the second criterion.

On the other hand, by assuming (26) we can prove (22) following a similar derivation.  $\Box$ 

<sup>6</sup>Any mechanism that slows down the processor's execution can be categorized as throttling.



Fig. 6. Runaway temperatures.

Compared to the second criterion, Theorem 1 provides a simpler mechanism with reduced complexity to detect thermal runaway.

We define the lowest temperature to meet both criteria 1 and 2 as the runaway temperature. As long as the transient temperature reaches the runaway temperature, thermal runaway cannot be avoided and the transient temperature will increase indefinitely if no appropriate thermal management is applied. We calculate the runaway temperature according to criteria 1 and 2 for different  $f_{\text{max}}$  with appropriate voltage scaling. We choose the maximum temperature constraint 110°C as it is the maximum temperature supported by current design technology. Fig. 6 shows the runaway temperatures for clock frequency from 7.0 to 7.25 GHz. As clock frequency increases, the runaway temperature decreases since the difference between power  $P(T_1)$  and  $P(T_0)$  increases. For clock frequency at 7.25 GHz, the runaway temperatures for integer units can be lower than the maximum temperature constraint 110 °C. Therefore, thermal runaway may become a severe problem in the near future as clock frequency continue to increase. Special thermal management schemes are required to combat this problem.

## F. Clock Gating

Due to its exponential dependence on temperature, leakage energy can be greatly affected by mechanisms which significantly reduce system power and temperature. Clock gating [36] reduces dynamic power by turning off the clock signal for idle components. It is shown in [17] that clock gating can indirectly affect leakage energy consumption by changing the temperatures of system components. In the rest of our experiments, we assume clock gating to all components and that clock gating can reduce dynamic power by 75%.

## IV. COUPLED POWER AND THERMAL MANAGEMENT

In this section, we study coupled power and thermal management using fetch toggling with the proportional-integral (PI) feedback controller presented in [19]. In fetch toggling, when the temperature is higher than a given threshold, the instruction fetch rate is decreased to reduce activity of processor components. A PI controller has two preset parameters: the *gain* and the temperature threshold to trigger thermal management (*setpoint*). The input of the PI controller is the highest on-chip

<sup>&</sup>lt;sup>7</sup>They are only necessary conditions when there is throttling.

temperature and the output of the PI controller is used to adjust instruction fetch rate by throttling L1 instruction cache, branch predictor and decode units with clock gating. Additionally, fetch toggling can reduce the number of instructions in the out-of-order window, thereby affecting activity of other units as well. We name the coupled power and thermal management with PI feedback controller as *Dynamic Power/Thermal Management* (DPTM).

# A. Importance of Temperature Dependent Leakage Power Model

Although leakage power has exponential dependence on temperature, studies in the literature tend to choose a fixed leakage power model corresponding to a representative temperature point for low implementation and simulation overhead. In this section, we show that in DPTM, ignoring the temperature dependence of leakage power may lead to either control failure or excessive performance penalty.

We implement both our new temperature dependent leakage power model (*accurate* model) and the fixed leakage power model (*simple* model) in DPTM. We choose the maximum temperature constraint 110 °C,  $V_{dd}$  1.55 V and  $f_{max}$  6.5 GHz. Since the component temperatures in our experiments in this section are usually in the range between 65 °C and 110 °C, we choose two temperature points 65 °C and 110 °C as reference temperatures for leakage power calculation in the simple model. Because leakage power at 65 °C and 110 °C are the lower and upper bounds of the leakage power in our accurate model, respectively, we further name them as *underestimated* model and *overestimated* model.

In this section, we design the PI controller using the following algorithm: first we select a few candidate of setpoints and gains, then we perform simulation for all the combinations of these candidates and finally we select the combination of setpoint and gain achieving the highest IPC (instructions per cycle) and no thermal constraint violations as the PI controller.

1) Control Failure by Underestimation of Leakage Power: We choose three candidates for setpoint: 109 °C, 109.4 °C and 109.8 °C, and three candidates for gain: 0.5, 1.0, 1.5. With the underestimated model, we design PI controller according to our algorithm choosing a setpoint of  $109.8 \,^{\circ}$ C and gain = 0.5. With this PI controller in DPTM, Fig. 7 plots the transient temperature curves simulated by both the underestimated model and the accurate model. For the underestimated model, it appears that the feedback thermal control effectively limits the maximum on-chip temperature. However, this appearance is erroneous due to underestimated leakage power. With accurate leakage model, the PI controller can no longer prevent thermal constraint violations. Clearly if we design the PI controllers according to underestimated leakage model, our PI controllers may fail to prevent the maximum on-chip temperature from exceeding the maximum temperature constraint. This example illustrates the importance of accurate leakage modeling in the study of dynamic thermal management.

2) Performance Penalty by Overestimation of Leakage Power: With the overestimated model, we choose three candidates for setpoint: 100 °C, 102.5 °C, and 105 °C, and three candidates for gain: 1.0, 3.0, and 5.0. By choosing smaller



Fig. 7. Transient temperature curves obtained by accurate model and underestimated model. The benchmark is *gcc*.

TABLE X IPC COMPARISON

|           | PI control | PI controller designed by |                |
|-----------|------------|---------------------------|----------------|
|           | accurate   | overestimated             | penalty by     |
| Benchmark | model      | model                     | overestimation |
| art       | 1.71       | 1.64                      | 4.09%          |
| bzip2     | 1.16       | 1.14                      | 1.36%          |
| equake    | 1.27       | 1.27                      | 0%             |
| gcc       | 1.40       | 1.33                      | 5.24%          |
| gzip      | 1.83       | 1.80                      | 1.85%          |
| mesa      | 0.74       | 0.74                      | 0%             |

setpoints and the larger gain, the PI controller can enforce throttling while the temperature is still low and become more sensitive to the increase of temperature, both of which help to eliminate temperature constraint violations. According to our algorithm, we obtain the PI controller with setpoint 102.5 °C and gain 1.0 for overestimated model. However, if we design the PI controller with accurate leakage model, we obtain another PI controller with setpoint 105 °C and gain 1.0. Table X shows the IPC results obtained under accurate model with PI controller designed by both accurate model and overestimated model. From Table X we can see that overestimated model leads to lower IPC due to excessive performance penalty by unnecessary throttling. The IPC obtained by a controller based on the overestimated model is up to 5.24% lower than that based on the accurate model. This result further indicates the necessity of coupled power and thermal modeling for thermal management.

# V. Optimal Voltage Scaling With Dynamic Power and Thermal Management

In this section, we study the following problem: given different packaging and cooling techniques, we consider voltage scaling with dynamic power and thermal management (DPTM) such that system performance is maximized. System performance is defined as throughput in billion instructions per second (BIPS) in (27):

Throughput = 
$$\frac{IPC \times \text{clock\_frequency}}{10^9}$$
 (27)

where clock\_frequency is the processor clock frequency.

 TABLE XI

 PERFORMANCE COMPARISON. RESULTS ARE THE AVERAGE OVER SIX SPEC 2000 BENCHMARKS: art, bzip2, equake, gcc, gzip and mesa

|                    | Design for worst-case<br>benchmarks without DPTM | Design for common-case benchmarks<br>with DPTM to avoid the worst-case |  |
|--------------------|--------------------------------------------------|------------------------------------------------------------------------|--|
| Performance (BIPS) | 8.5                                              | 9.06 (+ 6.59%)                                                         |  |

## A. System Performance With Air Cooling

In this subsection, we assume air cooling techniques with heatsink thermal resistance  $0.8^{\circ}$  C/W. As in Section IV-A, we choose the PI controller and fetch toggling mechanism for DPTM. We examine a number of values for  $V_{dd}$  and maximum temperature constraints for best performance. Because it is not realistic to design a specific PI controller for each set of  $V_{dd}$  and maximum temperature constraints according to our previous algorithm in Section IV-A, we choose setpoint as 5 °C lower than the maximum temperature constraints and fix the gain as 1.0.

We first study the performance impact of DPTM. The maximum temperature constraint is no more than 110 °C, and the  $V_{dd}$  is between 0.9 and 1.4 V. Without DPTM, the corresponding clock frequencies to guarantee temperature less than 110 °C for all benchmarks are between 5.0 and 6.41 GHz. On the other hand, with DPTM, the solution space can be increased through the added flexibility of DPTM, and the choices of clock frequency can be between 5.0 and 6.86 GHz. Table XI compares the maximum throughput between designs targeting at worst case thermal scenario among the benchmark set without DPTM and those targeting at common-case thermal scenario with DPTM. It is easy to see that by allowing higher BIPS for common-case benchmarks and reducing BIPS for worst case benchmarks to avoid temperature violation, DPTM helps to improve maximum throughput measured over the benchmark set by 6.59%.

Fig. 8 further presents the performance impact of DPTM under  $V_{dd}$  and temperature scaling. Without considering thermal management of performance, it has been assumed in literature that higher  $V_{dd}$  always leads to faster system clock frequency and therefore, higher throughput. However, higher  $V_{dd}$  leads to larger power consumption and higher temperature, which results in more throttling and larger IPC loss under DPTM. Therefore, higher  $V_{dd}$  does not always guarantee better throughput. Fig. 8 shows that by increasing  $V_{dd}$  from 1.2 V to 1.4 V, throughput can actually be reduced by up to 57% (for cases with maximum temperature constraint 80 °C). Clearly, optimal  $V_{dd}$  for the best throughput may not be the largest  $V_{dd}$ with the presence of DPTM. Voltage scheduling schemes may have to consider the thermal impact on performance, in order to decide the optimal  $V_{dd}$  for maximum throughput.

## B. Impact of Advanced Cooling Techniques

Better cooling techniques can help to reduce system thermal resistance, dissipate heat more quickly, and enable faster clock frequencies. Novel cooling techniques include cooling studs, microbellows cooling, microchannel cooling [37] and direct water spray-cooling on electronic devices [38]. In this subsection, we consider two representative heatsink thermal



Fig. 8. Average throughput with DPTM under different  $V_{dd}$  and maximum temperature constraints for six SPEC 2000 benchmarks: *art*, *bzip2*, *equake*, *gcc*, *gzip* and *mesa*.



Fig. 9. Average throughput and power efficiency under different  $V_{dd}$ , maximum temperature constraints and different cooling conditions for six SPEC 2000 benchmarks: *art*, *bzip2*, *equake*, *gcc*, *gzip* and *mesa*.

resistances: 1)  $R_t = 0.8$  °C/W for conventional air cooling; and 2)  $R_t = 0.067$  °C/W for water spray-cooling in [38], which we call *active cooling*, and study the impact of active cooling.

With active cooling, the maximum on-chip temperature is greatly reduced. As a consequence, we can: 1) reduce the maximum temperature constraint; and 2) increase  $V_{dd}$ , both of which enable faster clock frequency and larger solution space for better throughput. Fig. 9 compares the performance and power efficiency (power/throughput) between cases with and without active cooling. It shows that active cooling not only increases maximum throughput by 15.1%, but also slows down the decay of power efficiency as  $V_{dd}$  increases and improves maximum power efficiency by 11.45%. Traditionally the research of active cooling techniques are only limited to mainframe computers or power electronics. Our results in Fig. 9 clearly indicate that they can also be effective and may become necessary for microprocessors.

## VI. CONCLUSION AND DISCUSSION

Considering cycle accurate simulation, we have presented performance and leakage power models with supply voltage and temperature scaling, and developed a microarchitecture-level coupled thermal and power simulator PTscalar. With this simulator, we have shown that for different temperature, the leakage energy can be different by up to 38%, with corresponding total energy different by up to 24%. Hence, microarchitecture-level power simulation is hardly accurate without considering a temperature dependent leakage model. We have studied the system-level thermal runaway problem induced by leakage and temperature interdependence and show that it may be a severe problem in the near feature. We have further demonstrated that for dynamic thermal management, underestimating temperature dependency of leakage violates temperature constraints and overestimating temperature dependency of leakage leads to up to 5.24% performance loss. Finally, we have studied the optimal voltage scaling for best performance with dynamic power and thermal management under different packaging options. We have shown that dynamic power and thermal management allows designs targeting at common-case thermal scenario among benchmark sets and enables dynamic throttling to avoid the worst case thermal scenario. This can achieve 6.59% performance improvement compared to designs only targeting at the worst case. Additionally, the optimal  $V_{dd}$  for the best performance may not be the largest  $V_{dd}$  allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly.

With the 65-nm technology assumed in this paper, self-heating [39] may become an important issue. However, self-heating mainly exists as a problem for silicon-on-insulator (SOI) technology. As the SOI technology is not the mainstream technology, we do not consider the self-heating issue in this paper. Furthermore, our microarchitecture model in this paper ignores the thermal and voltage impact of control logic. The studies considering the control logic will be included in our future researches.

#### REFERENCES

- [1] N. H. E. Weste and K. Eshraghian, *Principles of CMOS VLSI Design*. Reading, MA: Addison-Wesley, 1993.
- [2] S. Borkar, "Design challenges of technology scaling," *IEEE Micro*, vol. 19, no. 4, pp. 23–29, Jul.–Aug. 1999.
- [3] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices. Cambridge, U.K.: Cambridge Univ. Press, 1998.
- [4] A. Chandrakasan, W. J. Bowhill, and F. Fox, *Design of High-Performance Microprocessor Circuits*. New York: IEEE Press, 2001.
- [5] A. S. Grove, "Changing vectors of Moore's law—Keynote Speech," presented at the Int. Electron Devices Meeting, San Francisco, CA, Dec. 2002.
- [6] D. Burger and T. Austin, "The Simplescalar tool set version 2.0," Univ. Wisconsin-Madison, 1997.
- [7] IMPACT architectural framework [Online]. Available: http://www.crhc. uiuc.edu/Impact
- [8] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, "The design and use of simplepower: a cycle-accurate energy estimation tool," presented at the Design Automation Conf. (DAC), Anaheim, CA, Jun. 2000.
- [9] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A framework for architectural-level power analysis optimization," presented at the 27th Annu. Int. Symp. Computer Architecture (ISCA), Vancouver, BC, Canada, Jun. 2000.

- [10] W. Liao, J. M. Basile, and L. He, "Leakage power modeling and reduction with data retention," presented at the Int. Conf. Computer-Aided Design (ICCAD), San Jose, CA, Nov. 2002.
- [11] Z. Cheng, M. Johnson, L. Wei, and K. Roy, "Estimation of standby leakage power in cmos circuits considering accurate modeling of transistor stacks," presented at the Int. Symp. Low Power Electronics and Design (ISLPED), Monterey, CA, Aug. 1998.
- [12] J. Butts and G. Sohi, "A static power model for architects," presented at the MICRO-33, Monterey, CA, Dec. 2000.
- [13] W. Jiang, V. Tiwari, E. de la Iglesia, and A. Sinha, "Topological analysis for leakage prediction on digital circuits," presented at the Joint 7th Asia and South Pacific Design Automation Conf. and 15th Int. Conf. VLSI Design, Bangalore, India, Jan. 2002.
- [14] S. Narendra, V. De, S. Borkar, D. Antoniadis, and A. Chandrakasan, "Full-chip subthreshold leakage power prediction model for sub-0.18 μm cmos," presented at the Int. Symp. Low Power Electronics and Design (ISLPED), Monterey, CA, Aug. 2002.
- [15] D. Brooks and M. Martonosi, "Dynamic thermal management for high-performance microprocessors," presented at the 7th ACM/IEEE Int. Symp. High-Performance Computer Architecture (HPCA), Nuevo Leone, Mexico, Jan. 2001.
- [16] A. Dhodapkar, C. Lim, and G. Cai, "Tem<sup>2</sup>p<sup>2</sup>est: A thermal enabled multimodel power/performance estimator," presented at the Workshop on Power Aware Computer Systems (PACS2000), Cambridge, MA, Nov. 2000.
- [17] W. Liao, F. Li, and L. He, "Microarchitecture level power and thermal simulation considering temperature dependent leakage model," presented at the Int. Symp. Low Power Electronics and Design (ISLPED), Seoul, Korea, Aug. 2003.
- [18] K. Skadron, T. Abdelzaher, and M. Stan, "Control-theoretic techniques and thermal-rc modeling for accurate and localized dynamic thermal management," presented at the 8th Int. Symp. High-Performance Computer Architecture (HPCA), Boston, MA, Feb. 2002.
- [19] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware microarchitecture," presented at the 30th Int. Symp. Computer Architecture, San Diego, CA, Jun. 2003.
- [20] S. Heo, K. Barr, and K. Asanovic, "Reducing power density through activity migration," presented at the Int. Symp. Low Power Electronics and Design (ISLPED), Seoul, Korea, Aug. 2003.
- [21] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan, "Hotleakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects," Dept. Comput. Sci., Univ. Virginia, Charlottesville, VA, Tech. Rep. CS-2003-05, Mar. 2003.
- [22] PTscalar (2004). [Online]. Available: http://eda.ee.ucla.edu/PTscalar/
- [23] S. Mutoh *et al.*, "I-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS," *IEEE J. Solid-State Circuits*, vol. 30, no. 8, pp. 847–854, Aug. 1995.
- [24] D. Lee, W. Kwong, D. Blaauw, and D. Sylvester, "Analysis and minimization techniques for total leakage considering gate oxide leakage," presented at the Design Automation Conf. (DAC), New Orleans, LA, Jun. 2003.
- [25] F. Li, L. He, J. Basile, R. J. Patel, and H. Ramamurthy, "High level area and current estimation," presented at the Asia and South Pacific Design Automation Conf., Yokohama, Japan, Jan. 2004.
- [26] S. Yang, "Logic synthesis and optimization benchmarks user guide-ver. 3.0," MCNC, Research Triangle Park, NC, Jan. 1991.
- [27] UC Berkeley Device Group, "BSIM 4 MOSFET Model," Univ. California, Berkeley, CA, Jul. 2002.
- [28] UC Berkeley Device Group, "Berkeley Predictive Technology Model (BPTM)," Univ. California, Berkeley, CA, Jul. 2002.
- [29] R. Cobbold, "Temperature effects on MOS transistors," *Electron. Lett.*, vol. 2, pp. 190–192, 1966.
- [30] J. Van De Vegte, *Feedback Control System*, 3rd ed. New York: Prentice-Hall, 1994.
- [31] P. Shivakumar and N. P. Jouppi, "Cacti 3.0: An Integrated Cache Timing, Power, and Area Model," WRL, Research Rep. 2001/2, 2001.
- [32] B. A. Gieseke *et al.*, "A 600 MHz superscalar RISC microprocessor with out-of-order execution," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 1997, pp. 176–177.
- [33] N. S. Palacharla and J. E. Smith, "Quantifying the complexity of superscalar processors," Univ. Wisconsin-Madison, Technology Rep. CSTR-96-1328, Nov. 1996.
- [34] The International Technology Roadmap for Semiconductors (2001). [Online]. Available: http://public.itrs.net/
- [35] R. Severns, "Safe operating area and thermal design for MOSPOWER transistors," Siliconix, Santa Clara, CA, Siliconix Application Note AN83-10, Nov. 1983.

- [36] V. Tiwari, D. Singh, S. Rajgopal, and G. Mehta, "Reducing power in high-performance microprocessors," presented at the Design Automation Conf. (DAC), San Francisco, CA, Jun. 1998.
- [37] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison-Wesley, 1990.
- [38] M. Shaw, J. Waldrop, S. Chandrasekaran, B. Kagalwala, X. Jing, E. Brown, V. Dhir, and M. Fabbeo, "Enhanced thermal management by direct water spray of high-voltage, high power devices in a three-phase, 18-hp ac motor drive demonstration," presented at the 8th Intersociety Conf. Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm'02), San Diego, CA, Jun. 2002.
- [39] B. M. Tenbroek, M. S. L. Lee, W. Redman-White, R. J. T. Bunyan, and M. J. Uren, "Impact of self-heating and thermal coupling on analog circuits in SOI CMOS," *IEEE J. Solid-State Circuits*, vol. 33, no. 7, pp. 1037–1046, Jul. 1998.



Lei He (S'94–M'99) received the B.S. degree in electrical engineering from Fudan University, China, in 1990, and the Ph.D. degree in computer science from the University of California at Los Angeles (UCLA) in 1999.

He is currently an Assistant Professor in the Electrical Engineering Department at UCLA. From 1999 to 2001, he was a faculty member at the University of Wisconsin, Madison. He has held industrial positions with Cadence, Hewlett-Packard, Intel, and Synopsys. His research interests include computer-aided design

of VLSI circuits and systems, interconnect modeling and design, programmable logic and interconnect, and power-efficient circuits and systems.

Dr. He received the Dimitris N. Chorafas Foundation Prize for Engineering and Technology in 1997, the Distinguished Ph.D. Award from the UCLA Henry Samueli School of Engineering and Applied Science in 2000, the NSF CAREER Award in 2000, the UCLA Chancellor's Faculty Development Award in 2003, and the IBM Faculty Award in 2003.



Weiping Liao (S'05) received the B.S. and M.S. degrees, both in physics, from the University of Science and Technology of China, Hefei, China, in 1996 and 1999, respectively, and the M.S. degree in computer engineering from the University of Wisconsin, Madison, in 2002. He is currently pursuing the Ph.D. degree in the Electrical Engineering Department at the University of California at Los Angeles.

His research interests include power efficient microarchitecture, leakage power modeling and reduction, and interconnect modeling and optimization.



**Kevin M. Lepak** (S'99–M'03) received the B.S., M.S., and Ph.D. degrees, all in electrical engineering, from the University of Wisconsin, Madison, in 1999, 2000, and 2003, respectively.

He is an active researcher in the area of computer architecture and also maintains an interest in VLSI design/CAD/EDA. He is currently with Advanced Micro Devices, Austin, TX, focused on multiprocessor system design, memory system design, performance evaluation, commercial workloads, and microarchitecture for future-generation micropro-

cessors and systems. He is also an Adjunct Faculty at The University of Texas at Austin.