# Coupled Power and Thermal Simulation with Active Cooling

Weiping Liao and Lei He $^\star$ 

Electrical Engineering Department University of California, Los Angeles, CA 90095 {wliao,lhe}@ce.ucla.edu

Abstract. Power is rapidly becoming the primary design constraint for systems ranging from server computers to handhelds. In this paper we study microarchitecture-level coupled power and thermal simulation considering dynamic and leakage power models with temperature and voltage scaling. We develop an accurate temperature-dependent leakage power model and efficient temperature calculation, and show that leakage energy can be different by up to 10X for temperatures between 35°C and 110°C. Given the growing significance of leakage power and its sensitive dependence on temperature, no power simulation without considering dynamic temperature calculation is accurate. Furthermore, we discuss the thermal runaway induced by the interdependence between leakage power and temperature, and show that in the near future thermal runaway could be a severe problem. We also study the microarchitecture level coupled power and thermal management by novel active cooling techniques that reduce packaging thermal resistance. We show that the active cooling technique that reduces thermal resistance from  $0.8^{\circ}C/W$ to  $0.05^{\circ}$  C/W can increase system maximum clock by up to 2.44X under the same thermal constraints.

# 1 Introduction

Power is rapidly becoming the primary design constraint for systems ranging from sever computers to handhelds, and the related thermal constraints are also emerging as an important issue. Thermal stress caused by high on-chip temperature and large temperature differentials between functional units may lead to malfunction of logic circuits, p-n junction breakdown, and clock skew [4] or ultimate physical failure of the microprocessor chip. Therefore, accurate power and thermal modeling is needed to develop and validate power and thermal optimization mechanisms.

<sup>\*</sup> This paper is partially supported by NSF CAREER award CCR-0306682, SRC grant HJ-1008, a UC MICRO grant sponsored by Analog Devices, Fujitsu Laboratories of America, Intel and LSI Logic, and a Faculty Partner Award by IBM. We used computers donated by Intel and SUN Microsystems. Address comments to lhe@ee.ucla.edu.

As semiconductor technology scales to smaller feature sizes, leakage power increases exponentially because transistor threshold voltages are reduced in concert with supply voltage to maintain transistor performance. For current highperformance design methodologies, the contribution of leakage power increases at each technology generation [1], and the Intel Pentium IV processors running at 3GHz in 0.13um technology already have an almost equal amount of leakage and dynamic power [2]. The significance of leakage power exacerbates the thermal problems since leakage power has an exponential dependence on temperature [1]. Given this, power and thermal modeling is hardly accurate without considering the inter-dependency between leakage and temperature.

Almost all existing microarchitecture-level power simulators Wattch [6], SimplePower [5] and PowerImpact [7] do not consider temperature dependence of leakage power and assume a fixed ratio between dynamic and leakage power. [9] proposes a leakage power model with temperature dependence characterized by a purely empirical formula, and further applies the model for a cycle-accurate coupled power and thermal simulation. However, the temperature dependence is characterized by a purely empirical exponential term  $exp(\frac{-a}{T-b})$  without providing a theoretical model, where a and b are coefficients and T is the temperature. Voltage scaling is not considered for either dynamic or leakage power in [9]. On the other hand, existing microarchitecture level thermal simulator HotSpot [17] models the thermal package such as spreader and heatsink and considers three dimensional heat transfer, but it fails to consider temperature dependency of leakage. In short, there is no existing simulator with accurate thermal modeling and accurate interdependence of temperature and leakage power.

In this paper, we present power models with clock, voltage, and temperature scaling based on the BSIM2 subtreshold leakage current model. We develop a coupled thermal and power microarchitecture simulator considering interdependence between leakage and temperature. With this simulator, we are able to accurately simulate the inter-dependence between power and temperature and evaluate microarchitectural power and thermal management techniques. We show the dramatic dependence of leakage power on temperature at the microarchitecture level based on the thermal resistance and chip area of Intel Itanium 2 processors within the temperature range between  $35^{\circ}C$  and  $110^{\circ}$ C. We also theoretically discuss thermal runaway induced by the interdependence between leakage power and temperature. These studies underscore the need for coupled power and thermal simulation. We further study the impact of active cooling techniques. We show that the active cooling technique that reduces thermal resistance from  $0.8^{\circ}$ C/W to  $0.05^{\circ}$ C/W can increase system maximum clock by up to 2.44X under the same thermal constraints.

The rest of this paper is organized as follows. In Section 2, we develop dynamic and leakage power models with both voltage and temperature scaling. In Section 3, we introduce both transient and stable-state temperature calculation and the experiments on thermal-sensitive energy simulations at microarchitecturelevel. In Section 4, we present the the impact of coupled power and thermal management with novel active cooling techniques. We conclude and discuss future work in Section 5.

# 2 Power Model with Temperature and Voltage Scaling

We define three power states: (i) *active mode*, where a circuit performs an operation and dissipates both dynamic power  $(P_d)$  and leakage power  $(P_s)$ . The sum of  $P_d$  and  $P_s$  is active power  $(P_a)$ . (ii) *standby mode*, where a circuit is idle but ready to execute an operation, and dissipates only leakage power $(P_s)$ . (iii) *inactive mode*, where a circuit is deactivated by power gating [12] or other leakage reduction techniques, and dissipates a reduced leakage power defined as inactive power  $(P_i)$ . A circuit in the inactive mode requires a non-negligible amount of time to wake up before it can perform an useful operation [7].

In cycle accurate simulations, power is defined as the energy per clock cycle. Therefore,  $P_d$  is equal to  $\frac{1}{2}f_sCV^2$  where C is the switching capacitance, V is the supply voltage and  $f_s$  is the switching factor per clock cycle. In essence,  $P_d$  is the energy to finish a fixed number of operations during one cycle. Consistently,  $P_s$  is defined as  $P_{so} * t$  where  $P_{so}$  is leakage power per second and t is the clock period. Same as  $P_s$ ,  $P_i = P_{io} * t$  is proportional to the clock period with  $P_{io}$  being reduced leakage power in the inactive mode.

### 2.1 Dynamic Energy with Voltage Scaling

Dynamic energy is consumed by charging and discharging capacitances. It is independent of temperature, but has a quadratic dependence with supply voltage. For VLSI circuits, the relationship between circuit delay and supply voltage  $V_{dd}$ is  $delay \propto V_{dd}/(V_{dd} - V_T)^2$ , where  $V_t$  is the threshold voltage. By assuming the maximum clock  $f_{max} = 1/delay$ , the appropriate supply voltage to achieve  $f_{max}$ can be decided by (1):

$$f_{max} \propto (V_{dd} - V_T)^2 / V_{dd} \tag{1}$$

Therefore, the dynamic energy for each cycle varies to achieve different  $f_{max}$ .

## 2.2 Leakage Estimation with Voltage and Temperature Scaling

Leakage Model with Temperature and Voltage Scaling Similar to [9], we use the leakage power model for logic circuits as (2):

$$P_s = N_{gate} * I_{avg} * V_{dd} \tag{2}$$

where  $N_{gate}$  is the total number of gates in the circuit and  $I_{avg}$  is the average leakage current per gate. We further consider temperature and voltage scaling according to the BSIM2 model subthreshold current model [1] as shown in (3):

$$I_{sub} = Ae^{\frac{(V_{GS} - V_T - \gamma V_{SB} + \eta V_{DS})}{n V_{TH}}} \left(1 - e^{-\frac{V_{DS}}{V_{TH}}}\right)$$
(3)

$$A = \mu_0 C_{ox} \frac{W}{L_{eff}} V_{TH}^2 e^{1.8}$$
(4)

where  $V_{GS}$ ,  $V_{DS}$  and  $V_{SB}$  are the gate-source, drain-source and source-bulk voltages, respectively;  $V_T$  is the zero-bias threshold voltage,  $V_{TH}$  is the thermal voltage  $\frac{kT}{q}$ ,  $\gamma$  is the linearized body-effect coefficient,  $\eta$  is the Drain Induced Barrier Lowering (DIBL) coefficient,  $\mu_0$  is the carrier mobility,  $C_{ox}$  is gate capacitance per area, W is the width and  $L_{eff}$  is the effective gate length.

From (3) we can see the temperature scaling for leakage current is  $T^2 e^{-\frac{1}{T}}$ , where T is the temperature, and the voltage scaling for leakage current is  $e^{-(\alpha V_{dd}+\beta)}$ , where  $\alpha$  and  $\beta$  are parameters to be decided. Based on these observation, we propose the following formula for  $I_{avg}$  considering the temperature and voltage scaling:

$$I_{avg}(T, V_{dd}) = I_s(T_0, V_0) * T^2 * e^{\left(-\frac{\alpha * V_{dd} + \beta}{T}\right)}$$
(5)

where  $I_s$  is a constant value for the reference temperature  $T_0$  and voltage  $V_0$ . The coefficients  $\alpha$  and  $\beta$  are decided by circuit designs. Values for  $\alpha$  and  $\beta$  as well as validation of (5) will be presented in Section 2.2.

We also improve the formula in [9] with better temperature and voltage scaling as shown in (6) - (8):

$$P_{so} = P_{circuits} + P_{cells} \tag{6}$$

$$P_{circuits}(T, V_{dd}) = (X * words + Y * word\_size) * V_{dd} * T^2 * e^{\left(-\frac{\alpha * V_{dd} + \beta}{T}\right)} (7)$$

$$P_{cells}(T, V_{dd}) = (Z * words * word\_size) * V_{dd} * T^2 * e^{\left(-\frac{\gamma * V_{dd} + \beta}{T}\right)} (8)$$

where  $P_{cells}$  is the leakage power dissipated by SRAM memory cells and  $P_{circuits}$  is the power generated by the circuits such as wordline drivers, precharge transistors, and etc.  $P_{circuits}$  essentially has the same format as (2) as  $X * words + Y * word\_size$  in (7) can be viewed as  $N_{gate}$ , and the scaling in  $P_{circuits}$  is the same as (5).  $P_{cells}$  is proportional to the number of SRAM memory cells.  $X, Y, Z, \gamma$  and  $\delta$  in (7) and (8) are coefficients decided by circuit designs. Values for  $X, Y, Z, \gamma$  and  $\delta$  as well as validation of (7) and (8) will be presented in Section 2.2.

**Leakage Model Validation** We collect the power consumption for different types of circuits at a few temperature levels by SPICE simulations. We then obtain the coefficients in (5) - (8) by curve fitting. Table 1 summarizes the

|                 | Logic circuits |            |            |           | Memory based units |           |           |  |
|-----------------|----------------|------------|------------|-----------|--------------------|-----------|-----------|--|
|                 | X              | Y          | $\alpha$   | $\beta$   | Z                  | $\gamma$  | δ         |  |
| Power gating    | 3.5931e-12     | 1.2080e-11 | -1986.1263 | 4396.0880 | 8.7286e-11         | -443.2760 | 3886.2712 |  |
| No power gating | 5.2972e-10     | 1.7165e-9  | -614.9807  | 3528.4329 | 5.2946e-10         | -711.9226 | 3725.5342 |  |

**Table 1.** Coefficients in (5) - (8) for 100nm technology, where MTCMOS and VRC are the power gating techniques for logic and SRAM arrays, respectively.

|               |                         |          | $I_{avg}$ or $P_{so}$ |         |                |
|---------------|-------------------------|----------|-----------------------|---------|----------------|
| Circuit       | Temperature ( $^{o}$ C) | $V_{dd}$ | formula               | SPICE   | abs. err. $\%$ |
| adder         | 100                     | 1.3      | 0.0230                | 0.0238  | 3.74           |
|               | 50                      | 1.3      | 0.00554               | 0.00551 | 0.71           |
| multiplier    | 100                     | 1.3      | 0.0209                | 0.0217  | 3.83           |
|               | 50                      | 1.3      | 0.00493               | 0.00506 | 2.63           |
| shifter       | 100                     | 1.3      | 0.0245                | 0.0255  | 3.92           |
|               | 50                      | 1.3      | 0.00592               | 0.00585 | 1.32           |
| SRAM 128x32   | 50                      | 1.3      | 54.1                  | 56.8    | 4.81           |
|               | 50                      | 1.0      | 21.62                 | 22.31   | 3.07           |
| SRAM $512x32$ | 50                      | 1.3      | 211.7                 | 227.2   | 6.85           |
|               | 50                      | 1.0      | 84.41                 | 88.83   | 4.98           |

**Table 2.** Comparison between our formula and SPICE simulation.  $I_{avg}$  and  $P_{so}$  are for logic circuits and SRAM arrays, respectively. The SRAM arrays are represented as "row number" x "column number". The units for  $I_{avg}$  and  $P_{so}$  are uA and uW, respectively.

coefficients for ITRS 100nm technology we used. Table 2 compares our highlevel leakage power estimation for logic circuits and SRAM arrays with SPICE simulations in ITRS 100nm technology. We use different circuits and temperature during curve fitting and verification. The overall difference between our formulas and SPICE simulation is less than 7%.

# 3 Coupled Power and Thermal Simulation

#### 3.1 Temperature Calculation

We develop the thermal model based on conventional heat transfer theory [13]. The stable temperature at infinite time can be calculated according to (9):

$$T = T_a + R_t * P \tag{9}$$

where T is the temperature,  $T_a$  is the ambient temperature, P is the power consumption, and  $R_t$  is the thermal resistance, which is inversely proportional to area and indicates the ability to remove heat to the ambient under the steadystate condition. According to (9), the heat loss to ambient can be modeled as  $P_o = (T - T_a)/R_t$ .

The unbalance between total power consumption P and heat loss to ambient  $P_o$  leads to the transient temperature T characterized by (10):

$$P - P_o = C_t \dot{T} \tag{10}$$

where  $C_t$  is the thermal capacitance. By substituting  $P_o$  in (10) with  $(T-T_a)/R_t$  we can get the differential equation (11):

$$R_t C_t \dot{T} + (T - T_a) = R_t P \tag{11}$$

where  $\dot{T} = \Delta T / \Delta t$  and  $\Delta T$  is the temperature change after a short time period  $\Delta t$ . By manipulating (11) we can get (12) for the temperature change  $\Delta T$ :

$$\Delta T = \frac{PR_t - (T - T_a)}{\tau} \Delta t \tag{12}$$

where  $\tau = R_t C_t$  is the thermal time constant. By solving (12) we can obtain an exponential form for temperature T in terms of time t and power, as shown in (13):

$$T = PR_t + T_a - (PR_t + T_a - T_0) \times e^{-\frac{t-t_0}{\tau}}$$
(13)

where T and  $T_0$  are temperatures at two different time points t and  $t_0$ . This exponential form clearly shows that the power has a *delayed* impact on the temperature. Note that our cycle-accurate simulation uses (12) directly to avoid the time-consuming exponential calculation.

Same as [9], in our thermal model, we have two different modes with different granularities to calculate the temperature: (i) *individual mode*. We assume that there is no horizontal heat transfer between components, and calculate a temperature for each individual component. In general, the horizontal heat reduces the temperature gaps between components. So the individual mode essentially gives the upper bound of the highest on-chip temperature and temperature gap. (ii) *universal mode*, which is similar to the thermal model in  $TEM^2P^2EST$  [10]. We assume the whole processor as a single component with a uniform thermal characteristic and temperature. The universal mode gives the lower bound of the highest on-chip temperature.

#### 3.2 Experiment Parameter Settings

Although our power and thermal models are applicable to any architecture, we study VLIW architecture in this paper. We integrate our thermal and power

model into the PowerImpact [7] toolset. The microarchitecture components in our VLIW processor include BTB, L1 instruction cache, L1 data cache, unified L2 cache, integer register file, floating-point register file, decoder units, integer units (IALUs) and floating-point units (FPUs). Among them, BTB, caches and register files are memory-based units, while the others are logic circuits. When calculating the power of memory-based units, we first partition the component into pieces of SRAM arrays with CACTI 3.0 toolset [14], then apply our formulas for power consumption of each SRAM array. The total component power consumption is the sum of power of all SRAM arrays. For IALUs and FPUs, we take the area and gate count in the design of DEC alpha 21264 processor [15], and scale from 350nm technology down to 100nm technology. For decode unit, we simply assume one decode unit has the same area and power consumption as one integer unit.

| Component           | Configuration                           |                                                                     |                |                    |  |  |  |
|---------------------|-----------------------------------------|---------------------------------------------------------------------|----------------|--------------------|--|--|--|
| Decode              | 6-issue width                           |                                                                     |                |                    |  |  |  |
| BTB                 | 512  entri                              | ies 4-way a                                                         | ssociative, Tv | vo-level predictor |  |  |  |
| Register file       | 128 integ                               | 128 integer and 128 floating-point registers with 64-bit data width |                |                    |  |  |  |
| Memory              | page size 4096 bytes, latency 30 cycles |                                                                     |                |                    |  |  |  |
| Memory Bus          | 8 bytes/                                | es/cycle                                                            |                |                    |  |  |  |
| Functional units    | Number                                  | Latency                                                             |                |                    |  |  |  |
| Integer unit        | 4                                       | 1 cycle for add, 2 cycles for multiply                              |                |                    |  |  |  |
|                     |                                         | and 15 cycles for division                                          |                |                    |  |  |  |
| Floating-point unit | 2                                       | 2 cycles for add/multiply, 15 cycles for division                   |                |                    |  |  |  |
| Cache               | Size                                    | Block size                                                          | Associativity  | Policy             |  |  |  |
| L1 Instruction      | 64 KB                                   | 32 bytes                                                            | 4              | LRU                |  |  |  |
| L1 Data             | 64 KB                                   | 32 bytes                                                            | 4              | LRU                |  |  |  |
| L2                  | 2MB                                     | 64 bytes 8 LRU                                                      |                |                    |  |  |  |

Table 3. System configuration for experiments.

To obtain a set of reasonable thermal resistances for components, we set the reference as the thermal resistance 0.8 °C/W for a chip with die size 374  $mm^2$  similar to Intel Itanium 2 [16]. Based on this reference, for each component, we calculate its thermal resistance as it is inversely proportional to its area. The whole chip thermal resistance is calculated in the same manner. Table 3 presents the micro-architecture configuration of the VLIW processors we study. Table 4 summarizes the power consumption, the thermal resistances and the areas for all components in our system. According to the thermal time constant for microarchitecture components without consider heatsink in [17], we set the thermal time constants as  $\tau = 100us$ , which is independent of component area.

To consider appropriate supply voltage scaling for varying clock, we assume that  $V_t$  is 20% of  $V_{dd}$  and  $V_{dd} = 1V$  obtains 3GHz clock as specified by the

|                       |       |        |        | $R_t$   | Area     |
|-----------------------|-------|--------|--------|---------|----------|
| Component             | $P_a$ | $P_s$  | $P_i$  | (K/W)   | $(mm^2)$ |
| BTB                   | 119   | 1.23   | 0.0504 | 64.4    | 1.63     |
| L1 Instruction Cache  | 535   | 1.145  | 0.0458 | 22.129  | 4.74     |
| L1 Data Cache         | 460   | 1.145  | 0.0458 | 20.967  | 4.99     |
| Unified L2 Cache      | 1858  | 34.2   | 1.37   | 1.401   | 59.8     |
| Integer Register File | 59.6  | 0.027  | 0.0011 | 24.692  | 4.24     |
| FP Register File      | 35.8  | 0.0275 | 0.0011 | 84.844  | 1.24     |
| One Decode Unit       | 79.2  | 0.68   | 0.0068 | 236.355 | 0.44     |
| One IALU              | 79.2  | 0.68   | 0.0068 | 236.355 | 0.44     |
| One FPU               | 158   | 0.68   | 0.0068 | 125.599 | 0.83     |

**Table 4.** Power consumption (in pJ/cycle), thermal resistance  $R_t$  and areas for all components. For 100nm technology, we choose 1V supply voltage and 3GHz clock rate as specified by the ITRS. The decode, integer ALU and FPU are only one unit among total six, four and two units. The temperature is  $35^{\circ}C$ . Note the  $P_s$  is relative small due to the low temperature.

ITRS. According to Equation (1) the corresponding  $V_{dd}$  for a range of clocks in our experiments is shown in Table 5.

| Clock (GHz)  | 2     | 3   | 4    | 5     |
|--------------|-------|-----|------|-------|
| $V_{dd}$ (V) | 0.667 | 1.0 | 1.33 | 1.667 |

Table 5.  $V_{dd}$  after appropriate voltage scaling for different clocks



**Fig. 1.** Whole chip temperature curves obtained by the universal mode for different time step  $t_s$ . The clock frequency is 2GHz. Three different starting temperatures are chosen: (a) 35°C; (b) 40°C; and (c) 80°C. No throttling is applied. Therefore, the results are independent of benchmarks.

#### 3.3 Chip Temperature

In our experiments, we update temperatures after each time step  $t_s$ . We then update the power value with respect to new temperature for each  $t_s$ . Smaller  $t_s$  gives a more accurate transient temperature analysis, e.g.,  $t_s = 1$  cycle represents the cycle accurate temperature calculation. Figure 1 plots the transient temperature for the whole chip calculated under different  $t_s$  shown as the percentages of the thermal time constant, where 0.5% of the thermal time constant is equal to 1000 clock cycles for a 2GHz clock. When  $t_s \leq 0.5\%$  of the thermal time constant, the temperatures are identical to those with  $t_s = 1$  cycle. Observable difference appears when  $t_s$  is increased to 5% of the thermal constants and significant error is induced when  $t_s = 25\%$  of the thermal constant. Clearly, it is not necessary to update temperatures for each cycle. Since 0.5% of thermal constants always lead to negligible error on temperature calculation compared with the cycle accurate temperature calculation, we only update temperatures and power values after every period of 0.5% of the thermal time constants in the rest of the paper.

Note in Figure 1, we also present transient temperature with different starting temperatures. Clearly, different starting temperatures lead to the virtually same stable temperature without considering the thermal runaway problem which will be discussed in Section 3.5.

#### 3.4 Temperature Dependent Leakage Power and Maximum Clock

Figure 2 shows the experimental results for total leakage energy consumption at 2.5GHz clock. We assume there is no throttling, i.e.,  $P_a$  is dissipated in every cycle. We study two cases: one assumes a fixed temperature, and another considers energy consumption with temperature dependence in both individual mode and universal mode. From Figure 2 we can see that by changing the temperature from  $35^{\circ}$ C to  $110^{\circ}$ C, the total leakage energy can be changed by a factor of 10X. Figure 2 clearly shows that any study regarding leakage energy is not accurate if the thermal issue is not considered. To consider temperature in methods in [7,18], the designers need to assume a fixed temperature appropriate for the processor and the environment, and then use leakage values at this temperature. How to decide the appropriate temperature is of paramount importance for accurate energy estimation, and it is an open problem in the literature. Our work actually presents an approach to select the appropriate temperature.

Faster system clock is always desired in the high-performance processor designs. However, as clock increases, the total energy and system temperature both increase as well. The maximum temperature and maximum temperature gap constraints prevent us from increasing the clock rate indefinitely. In the following experiments, we assume the maximum allowable temperature is 110°C which is the maximum temperature supported by current semiconductor packaging techniques, and the maximum temperature gap among components is 40°C. We use the individual mode to calculate the maximum temperature and the maximum



Fig. 2. Total Leakage energy consumption without any throttling. We study fixed temperatures of  $35^{\circ}$ C and  $110^{\circ}$ C, as well as the case with dynamically updated temperature. The cases of "ind" and "uni" stand for the individual mode and universal mode, respectively. The clock is 2.5GHz. Note the results are independent of benchmarks in the no-throttling cases.

temperature gap, where the maximum temperature is set as the largest temperature among all components <sup>1</sup>. Table 6 shows the maximum system temperature and the maximum temperature gap without any throttling. We can see that the maximum clock with thermal constraints is about 1.5GHz when there is no throttling.

| Clock (GHz)         | 0.5    | 1    | 1.5   | 2     | 2.5    |
|---------------------|--------|------|-------|-------|--------|
| 1                   |        |      |       |       | 61.4 - |
|                     | 36.016 | 41.5 | 56.7  | 87.3  | 157.2  |
| Max Temperature Gap | 0.92   | 3.97 | 19.19 | 46.23 | 110.44 |

**Table 6.** Maximum temperatures (*Max T*) and temperature gaps (*Max Gap*) among components for different clocks without any throttling. The unit for temperatures is  $^{\circ}$ C. The ambient temperature is  $35^{\circ}$ C. Note the results are independent of benchmarks in the no-throttling cases.

#### 3.5 Thermal Runaway

The MOSFET thermal runaway problem due to the positive feedback loop between the on-resistance, temperature and power of MOSFET is widely known

<sup>&</sup>lt;sup>1</sup> The universal mode gives us a lower bound of the maximum temperature.

[19]. In this section we will present another thermal runaway problem due to the interaction between leakage power and temperature. As the component temperature increases, its leakage power increases exponentially. The increase of power consumption further increases the temperature until the component is in thermal equilibrium with the package's heat removal ability. If the heat removal is not adequate, thermal runaway occurs as the temperature and leakage power interact in a positive feedback loop and both increase to infinity. For transient temperature  $T_0$  and  $T_1$  at consecutive time  $t_0$  and  $t_1$  and corresponding power  $P(T_0)$  and  $P(T_1)$ , we define the following two criteria as necessary conditions for the thermal runaway to occur:

- 1.  $T_1 > T_0$ , i.e., the temperature should be increasing.
- 2. the increment of power is larger than the increment of package's heat removal ability. The package's heat removal ability is defined as  $P_o(T) = \frac{T-T_a}{R_t}$  where  $T_a$  and  $R_t$  are ambient temperature and thermal resistance, respectively.

The second criterion can be mathematically formulated as (14) with relationship between  $T_0$  and  $T_1$  defined by (15):

$$P(T_1) - P(T_0) > \frac{T_1 - T_0}{R_t}$$
(14)

$$T_1 - T_0 = \frac{P(T_0)R_t - (T_0 - T_a)}{\tau}(t_1 - t_0)$$
(15)

where (15) is derived from (12).

In addition to temperatures, (14) and (15) require knowledge of runtime power,  $R_t$ ,  $\tau$  and  $T_a$ . We can simplify the second criterion with Theorem 1.

**Theorem 1** Criterion (2) is equivalent to  $\frac{d^2T}{dt^2} > 0$ , where T is temperature and t is time.

The detailed proof of Theorem 1 can be found in [20].  $\Box$ 

Compared to (14) and (15), Theorem 1 provides a simpler mechanism with reduced complexity to detect thermal runaway.

We define the lowest temperature to meet the criteria 1 and 2 as runaway temperature. As long as the transient temperature reaches the runaway temperature, thermal runaway happens and the transient temperature will increase to infinity if no appropriate thermal management is applied. Figure 3 plots transient temperature curves with thermal runaway.<sup>2</sup> It clearly shows that as long as the transient temperature reaches the runaway temperature, thermal runaway occurs. Note two starting temperatures,  $35^{\circ}C$  and  $55^{\circ}C$ , are chosen in Figure 3. It is easy to see the starting temperature is independent of transient temperature behavior and thermal runaway is independent of the starting temperature because runaway temperature is decided by the power and the package's heat removal ability.

 $<sup>^{2}</sup>$  Memory units such as caches present similar curves and therefore are not shown.



Fig. 3. Transient temperature curves one IALU with 5.5GHz clock. By reaching the runaway temperature, the thermal runaway happens and the transient temperature finally increases to infinity. The thermal runaway temperature is labeled. No throttling is applied.



Fig. 4. Runaway temperatures for different clocks and different components.

We calculate the runaway temperature according to criteria 1 and 2 for different clocks. Figure 4 shows the runaway temperatures for clocks from 4.5GHz to 6.5GHz. As clock increases, the runaway temperature decreases since the difference between power  $P(T_1)$  and  $P(T_0)$  increases. For clocks faster than 5.5GHz, the runaway temperatures of integer units are below our maximum temperature constraint  $110^{\circ}C$ . In other words, we can not eliminate the thermal runaway by simply limiting the operating temperature to be no more than maximum junction temperature supported by current packaging techniques. We anticipate that thermal runaway could be a severe problem in the near future as the clock keeps increasing. Special thermal management schemes are expected to encounter this problem.

## 4 Power and Thermal Management with Active Cooling

As we can see from previous discussion, the designer's desire to increase system clock can be severely limited by thermal constraints. Better packaging and active cooling techniques can help to remove the thermal resistance, dissipate heat more quickly, and enable faster clocks. [4] discusses a few active cooling techniques such as cooling studs, microbellows cooling and microchannel cooling. [21] introduces a novel active cooling technique by direct water spray-cooling on electronic devices. In this section, we assume individual mode and consider three thermal resistance value: (i)  $R_t = 0.8^{\circ}$ C/W for the conventional cooling, (ii)  $R_t = 0.05^{\circ}$ C/W for water spray-cooling in [21], and (iii)  $R_t = 0.45^{\circ}$ C/W, a value in between the above two. We call both (ii) and (iii) as active cooling and study the impact of active cooling.

In our coupled power and thermal management, we turn off the clock signal for idle components by clock gating [22] and assume clock gating reduces 75% dynamic power. Our experiments show that clock gating can achieve the maximum clock 2.25GHz with the thermal resistance  $0.8^{\circ}$ C/W under the same thermal constraints as those in Section 3. Furthermore, similar to [9], we evenly distribute instructions to functional units and eliminate the temperature gaps between integer units.



Fig. 5. Maximum temperature under individual mode with different thermal resistance.

#### 4.1 Maximum Clock

Figure 5 plots maximum temperatures for different clocks with different  $R_t$ . Obviously by applying active cooling techniques we can effectively increase the maximum clock while limiting the system temperature well below the thermal constraints. Figure 6 plots the maximum temperature gaps under different cooling techniques and clocks. By combining results in Figure 5 and 6 with the thermal constraints applied in Section 3.4, we can increase system clock to up to 5.5GHz by scaling  $V_{dd}$  up with  $R_t = 0.05^{\circ}$ C/W. Compared to the 2.25GHz maximum clock with  $R_t = 0.8^{\circ}$ C/W, the active cooling technique with  $R_t = 0.05^{\circ}$ C/W



Fig. 6. Maximum temperature gap under individual mode with different thermal resistance.

can increase the maximum clock by the factor of 2.44X under the same thermal constraints.



Fig. 7. Total energy consumption under individual mode with different thermal resistances. Note a few bars for clock at 3.5GHz and 4GHz are missing due to thermal runaway.

#### 4.2 Total Energy

Figure 7 shows the total energy consumption with three different thermal resistances  $R_t$ . Clearly the cooling techniques substantially reduce the total energy at the same clock. Compared to  $R_t$  of 0.45,  $R_t$  of 0.05 reduces the total energy by up to 18%. From Figure 7 we can also see that the energy reduction with active cooling techniques increases as clock increases, which means active cooling techniques is more effective for faster clocks. Note that in Figure 7 a few bars for  $R_t = 0.45$  and  $0.8^{\circ}$ C/W are missing due to thermal runaway. Traditionally the active cooling techniques such as cooling stubs and microchannel cooling [4] are only applied to mainframes computers. Our result clearly indicates that they can also be effective and may become necessary for microprocessors.

## 5 Conclusions and Discussions

Considering cycle accurate simulation, we have presented dynamic and leakage power models with clock, supply voltage and temperature scaling, and developed the coupled thermal and power simulation at the microarchitecture level. With this simulator, we have shown that the leakage energy can be different by up to 10X for different temperatures. Hence, microarchitecture level power simulation is hardly accurate without considering temperature dependent leakage model. We have studied the thermal runaway problem induced by interdependence between leakage power and temperature, and show that it could be a severe problem in the near feature as the runaway temperature can be much lower than the maximum temperature packages can support. We have studied the microarchitecture level coupled power and thermal management by novel active cooling techniques. We show that under the same thermal constraints, active cooling techniques such as water spray-cooling that reduces thermal resistance from  $0.8^{\circ}$ C/W to  $0.05^{\circ}$ C/W can increase the maximum clock by a factor of 2.44X. In this paper, we use lumped thermal model without distinguishing packaging components such as heat spreader and heatsink. In fact, we have developed a coupled power and thermal simulator PTscalar, which integrates temperature and voltage scalable leakage model with accurate thermal calculation considering three dimensional heat transfer and the packaging components such as heat spreader and heatsink. This tool is available at http://eda.ee.ucla.edu/PTscalar. We believe that conclusions in this paper are still valid under the new PTscalar tool.

## 6 Acknowledgments

The authors would like to thank Dr. Kevin Lepak at Advanced Micro Devices, Inc. for helpful discussions.

# References

- 1. A. Chandrakasan, W. J. Bowhill, and F. Fox, *Design of High-Performance Microprocessor Circuits*. IEEE Press, 2001.
- 2. A. S. Grove, "Changing vectors of moore's law," in *Keynote speech*, *International Electron Devices Meeting*, Dec 2002.
- T. Burd, T. Pering, A. Stratakos, and R. Bordersen, "A dynamic voltage-scaled microprocessor system," in 2000 IEEE International Solid-State Circuits Conference Digest of Technical Papers, Feb 2000.
- H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990.
- 5. W.Ye, N.Vijaykrishnan, M.Kandemir, and M.J.Irwin, "The design and use of simplepower: a cycle-accurate energy estimation tool," in *DAC*, 2000.
- D.Brooks, V.Tiwari, and M.Martonosi, "Wattch: A framework for architecturallevel power analysis optimization," in *ISCA*, 2000.

- W. Liao, J. M. Basile, and L. He, "Leakage power modeling and reduction with data retention," in *ICCAD 02*, Nov 2002.
- 8. J. Butts and G. Sohi, "A static power model for architects," in *Proc. of MICRO33*, December 2000.
- 9. W. Liao, F. Li, and L. He, "Microarchitecture level power and thermal simulation considering temperature dependent leakage model," in *ISLPED*, Aug 2003.
- A. Dhodapkar, C. Lim, and G. Cai, "Tem<sup>2</sup>p<sup>2</sup>est: A thermal enabled multi-model power/performance estimator," in Workshop on Power Aware Computer Systems, Nov 2000.
- K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware microarchitecture," in *Proceedings of the 30th International Symposium on Computer Architecture*, 2003.
- S. Mutoh and *et al*, "1-v power supply high-speed digital circuit technology with multithreshold-voltage cmos," *IEEE Journal of Solid-state circuits*, vol. 30, pp. 847–854, Aug. 1995.
- 13. J. V. D. Vegte., Feedback Control System, 3rd Edition. Prentice Hall, 1994.
- 14. P. Shivakumar and N. P. Jouppi, "Cacti 3.0: An integrated cache timing, power, and area model," in WRL Research Report 2001/2, 2001.
- B. A. Gieseke and et al, "A 600mhz superscalar risc microprocessor with out-oforder execution," in *Proc. IEEE Int. Solid-State Circuits Conf.*, pp. 176–177, 1997.
- J. Stinson and S. Rusu, "A 1.5ghz third generation itanium 2 processor," in DAC, June 2003.
- K. Skadron, T. Abdelzaher, and M. Stan, "Control-theoretic techniques and thermal-rc modeling for accurate and localized dynamic thermal management," in *Proceedings of the Eighth International Symposium on High-Performance Computer Architercture*, 2002.
- H. Hanson, M. Hrishikesh, V. Agarwal, S. Keckler, and D. Burger, "Static energy reduction techniques for microprocessor caches," in *Proceedings of the International Conference on Computer Design*, 2001.
- R. Severns, "Safe operating area and thermal design for mospower transistors," in Siliconix applications note AN83-10, Nov 1983.
- W. Liao and L. He, Microarchitecture Level Power and Thermal Simulation Considering Temperature Dependent Leakage Model. Technical Report 04-246, University of California at Los Angeles, 2003.
- 21. M. Shaw, J. Waldrop, S. Chandrasekaran, B. Kagalwala, X. Jing, E. Brown, V. Dhir, and M. Fabbeo, "Enhanced thermal management by direct water spray of high-voltage, high power devices in a three-phase, 18-hp ac motor drive demonstration," in *Eighth Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems*, 2002.
- 22. V. Tiwari, D. Singh, S. Rajgopal, and G. Mehta, "Reducing power in high-performance microprocessors," in *DAC*, 1998.

16