I am working on two parts: one is the dynamic voltage scaling for
network processor, the other is the multi-core design to explore
trade-off between ILP and TLP, working with Changbo and Luke.
For the dynamic voltage scaling part, I have been hacking the
open-source Intel IXP simulation named Nepsim. This simulator
include power model for dynamic power estimation, and a leakage
model estimating leakage power as a fix precentage of dynamic
power. Currently I have started modifying the code and implement
the dynamic voltage scaling. First I will add in the Vdd-frequency
relationship for this simulator; then I will provide seperate
voltage supply for each processing cores; after that I implement
different packet arrival rates according to user-spcified input;
and finally I will put in the dynamic voltage scaling management
considering the transition overhead of each voltage regulator
module.
Figure 1 shows the performance and power when the network
processor is assigned different voltage and frequency.
This is static voltage assignment, not dynamic scaling.
The same voltage and frequency are assigned to all cores.
It is easy to see that there is no obvious optimal point
for both power and performance. Dynamic voltage scaling
can better explore the irregular behavior of packet processing.
Figure 1: Performance and power with different core
frequency. Vdd is from 1.0V to 1.3V. SRAM frequency is 200MHz.
SDRAM frquency is 150MHz. IXbus (bus between on-chip FIFO
and MAC) frequency is 80MHz.
For the multi-core design part, below is the outline:
1. Target architecture: chip multi-processing (CMP). We don't
consider SMT this time but definitely need to consider it for
future development.
2. Benchmark characteristics: our optimization focuses on given
sets of benchmarks. For any benchmark, we characterize two
properties by profiling: (1) instruction level parallelism: this
property indicate the amount of ILP in this benchmark. This feature
can be presented by the IPC under ideal-case simulation (wide
issue-width superscalar, ideal branch prediction, unlimited number
of functional units, ideal caches). (2) memorism. This property
indicates the main memory dependency of this benchmark, i.e. rate
of memory request of this benchmark. This feature can be presented
by L2 cache miss rate for per unit L2 cache, for example, miss
rate / KB. Here we can simulate a few L2 cache settings and use
curve fitting to obtain the miss rate / KB.
3. Problem formulation: for given benchmark set with ILP and memorism
characteristics, we try to design a CMP to maximum performance under
the area. Power constraint can be considered later.
4. Cores: we can have two kinds of core designs: (A) homogeneous cores.
In this case, we only have to decide the number of core and the
configuration of each core (with clock frequency). (B) heterogeneous
cores. In this case the configuration of cores are different. The
homogeneous design is simple but the heterogeneous design can better
match the characteristics of each individual benchmark for better performance.
5. Performance metric is throughput. No only the IPC, but also the
clock frequency should be taken into account. For benchmarks with
high ILP, we tend to design large pipeline structure with wide
issue-width and multiple functional units. For benchmarks with
high memorism, we tend to design cores with large caches and also
reduce the number of cores.
6. Exploration: we use SA to obtain the optimal solution. Each SA
solution is evaluated quickly by leveraging the analytical superscalar
model and bus model. ILP and memorism of each benchmark should be fed
into the SA. This will take some thinking and discussions.
7. Comparison: we can have three comparison. First, we can compare
the performance of our CMP to a wide issue-width superscalar which
only explore ILP. Second, we can compare to the CMP which simply
put existing cores together. Third, we can compare to the ideal case.
8. Todo: first we need to profile the benchmarks. I can work on that.
Second we need a table of all microarchitecture components. For example,
we may have different caches and different branch predictors to choose.
For each component with different configuration, we need to obtain the
area and access latency in ns. Changbo has similar data in previous
submission and can build up the table by making some changes on previous
data. We can target at 65nm technology. Third, the design of interconnect
pipeline really depends on the clock frequency, which is something
considered during our optimization. So the interconnect pipeline
should be decide on-the-fly. Fourth, we should formulize the SA
procedure. I will work with Changbo and Luke for this.