OU: An FPGA-based Overlay Processor for Convolutional Neural Networks

Yunxuan Yu, Chen Wu, Tiandong Zhao, Kun Wang, Senior Member, IEEE, Lei He, Senior Member, IEEE

Abstract—FPGA provides rich parallel computing resources with high energy efficiency, making it ideal for deep convolutional neural network (CNN) acceleration. In recent years, automatic compilers have been developed to generate network-specific FPGA accelerators. However, with more cascading deep CNN algorithms adapted by various complicated tasks, re-configuration of FPGA devices during runtime becomes unavoidable when network-specific accelerators are employed. Such reconfiguration can be difficult for edge devices. Moreover, network-specific accelerator means regeneration of RTL code and physical implementation whenever network is updated. This is not easy for CNN end users. In this paper, we propose a domain-specific FPGA overlay processor, named OPU to accelerate CNN networks. It offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without re-configuration of FPGA for switch or update of CNN networks. Our OPU instructions have complicated functions with variable runtimes but a uniform length. The granularity of instructions is optimized to provide good performance and sufficient flexibility, while reduce complexity to develop micro-architectures and compiler. Experiments show that OPU can achieve an average of 91% runtime MAC efficiency (RME) among 9 different networks. Moreover, for VGG and YOLO networks, OPU outperforms automatically compiled network-specific accelerators in the literature. In addition, OPU shows 3.35× better power efficiency compared with Titan Xp. For a real time cascaded CNN networks scenario, OPU is 2.9× faster compared with edge computing GPU Jetson Tx2, which has similar amount of computing resources.

Index terms— FPGA Acceleration, CNN overlay Processor, Hardware-software co-design

I. INTRODUCTION

FPGA acceleration for CNNs has drawn much attention in recent years [1]–[11]. Well-designed FPGA accelerator for CNN can leverage full capacity of parallelism in network computation to achieve low latency and high performance. Moreover, FPGA comes with the advantage of reconfigurability that enables fast adoption to new CNN architectures. Additionally, FPGA has higher energy efficiency compared to CPU or GPU.

Implementing a high-performance FPGA accelerator can be time-consuming as it normally involves parallel architecture exploration, memory bandwidth optimization, area and timing tuning, as well as software-hardware interface development. This leads to the development of automatic compilers for FPGA CNN accelerators [8]–[11], where hardware description code of target accelerators can be generated automatically based on parametric templates, and design space exploration is simplified to parameter optimization with regard to network structure and hardware resource constraints.

However, some disadvantages still remain. While RTL code can be generated as the final output of auto-compilers, it still takes logic synthesis, placement and routing to obtain the final bitstream. Furthermore, the resulting design may fail timing requirements. Instead of fixing timing failing paths as in regular FPGA design process, in auto-compiler process, one can only adjust module parameters or relax timing constraints at the expense of performance degradation. Moreover, nowadays complex deep learning tasks usually involve cascaded network flow, it would be inefficient, and sometime impossible for edge computing to constantly re-burn FPGA for different networks during runtime.

In this work, we propose a RTL based hand-coded FPGA overlay domain-specific processor unit (OPU) with software-programmability and fast compilation time, targeting at general CNN accelerations. The processor accepts network architecture from deep learning framework such as Tensorflow [12]. Each time a new network configuration is given, instead of re-generating a new accelerator, we compile the network into instructions to be executed by OPU. OPU has fine-grained pipe-line, and it explores parallelism of different CNN architectures. This ensures an average of 91% runtime utilization of computing resources as shown by experiments on nine different network architectures, including YOLO [13], VGG [14], GoogleNet [15] and ResNet [16]. Moreover, superior power efficiency compared with Titan GPU (both batch = 1 and batch = 64) is observed for all network cases. In addition, for cascaded CNN networks to detect car license plate, OPU is 2.9× faster compared with edge computing GPU Jetson Tx2 with similar amount of computing resources.

More specifying, the proposed overlay processor OPU has following features:

- **CPU/GPU like user friendliness.** As shown in Fig. 1, CNN network is compiled into instructions. This is done once for each network. Then instructions are executed by
 Moreover, the granularity of their instructions is larger than that for our OPU. In [17], one block of about 10 instructions are used for a whole sub-graph defined as a list of chained functions, which is normally a single convolution layer and an optional pooling layer. In contrast, our OPU has the instruction set with smaller granularity (to be discussed in section III) than that in [17]. Each typical operation in CNN inference is mapped to a specific type of instruction, resulting in high runtime efficiency. Moreover, different CNNs can be compiled and then executed without FPGA reconfiguration.

### III. Instruction Set Architecture

Instruction set architecture (ISA) is the key to a processor. Our OPU is specific for CNN inference. We identify all the necessary operations during CNN inference and group them into different categories. Each category maps to one type of instruction with adjustable parameters for flexibility. Our instructions are 32-bit long and have complicated functions and variant runtimes (up to hundreds of cycles). CNN inference can be executed by OPU without a general purpose processor such as CPU.

There are two types of instructions defined: Conditional instruction (C-type) and Unconditional instruction (U-type). C-type instruction specifies target operations and sets operation trigger conditions, while U-type instruction delivers corresponding operation parameters for its paired C-type. As shown in Fig. 2, one instruction unit contains one C-type instruction with \(0 - n\) U-type instructions. This instruction block consisting of a number of basic units is fetched together and then is distributed to PE modules. The least significant bit of instruction indicates the end of current instruction block when its value is 0.

#### A. Conditional instruction

Conditional- or C-type instructions contains operation (OP) code and trigger condition. OP code identifies the target operation while trigger condition defines when operation is ready to execute.

Six types of C-instructions are defined below, each operates on a slice of data block:
• **Memory Read** transforms data from external memory to on-board memory. It operates in two modes to accommodate for different data read patterns. Received data will be reorganized and distributed to three destination buffers corresponding to the feature map, kernel weighs and instructions, respectively.

• **Memory Write** sends the block of computational results back to external memory.

• **Data Fetch** performs data read from on-board feature map and kernel buffer, then feeds to computation engine. Its working pattern can be flexibly adjusted by placing constraint parameters on row and column address counters, read strides, and data reorganize modes.

• **Compute** controls all processing units (PEs). One PE computes the inner product of two 1D vectors of length \(N\), which is set to 16 in current micro-architecture implementation based on widely used CNNs families’ architectures. This sufficiently guarantees the design space exploration for different networks. Results of PEs can be summed up in different modes based on parameter setting.

• **Post Process** includes pooling, activation, data quantization, intermediate result addition and residual operations. Selected combination of before-mentioned operations are executed when post process is triggered.

• **Instruction Read** reads a new instruction block from instruction buffer and directs it to target operation modules.

Each aforementioned instruction introduced above leads one instruction unit. Instead of linking all the operation modules together in a fixed pipeline, we organize our operations in a dynamic pipeline fashion and each module is controlled by an individual instruction unit for more flexibility. For example, after one memory read is called for feature map loading, multiple data fetch and compute may be called to reuse loaded feature map data. Then at certain point during computation, memory read can be called again to replace kernel weights data (in the case where kernel size is large). When residual layer is encountered, memory read is called one extra time to load feature map data from short cut [16] for post process. Individual control of each module by instruction greatly simplifies overall hardware control frame, and enhances the architecture applicability to different network configurations.

To realize efficient instruction control, trigger condition is employed, so instruction is not executed immediately upon read. In CNN inference flow, each operation depends on previous operations based on different operating patterns, which have limited variations. We design a trigger condition list for individual operation, then modify trigger condition index (TCI) by instruction at runtime to set module initiation dependency. Moreover, using a dependency based execution strategy relaxes the order enforcement on instruction sequence. Cause memory related operations has the uncertainty in execution time due to extra refresh latency. The resulting system has a simple instruction update scheme. Several instructions executed at different time points can be put into the same instruction block and read at the same time. As shown in Fig. 3, after the first instruction read, initial TCI\(_0\) is set at \(t_0\) for all three operations. At \(t_1\), memory access is triggered then executes for \(t_1 - t_2\). Data fetch is triggered upon finishing memory operation and post process is triggered at \(t_3\). Next instruction read that updates TCI\(_1\) for Data fetch and post process can be performed at any time point between \(t_3\) and \(t_7\). Moreover, we store current TCI to avoid setting the same condition repeatedly when modules operate in one pattern consecutively (at time \(t_0\) and \(t_5\), memory access is triggered by the same TCI). This shortens the instruction sequence over \(10\times\).

### B. Unconditional instruction

Unconditional- or \(U\)-type instruction provides operation related parameters and is generated on an updating demand-based scheme, as parameters are stored to reduce the total length of instruction sequence.

Several \(U\)-type instructions combined can update the complete parameters list for one \(C\)-type operation. But in general, when operation pattern switches, only a subset of parameters are changed accordingly. Flexible combinations of \(U\)-type instructions can update necessary parameters with minimum instruction cost. We group parameters that are closely related to each other and have similar updating rates in one \(U\)-type instruction. This reduces the possibility of loading futile instruction sections, thus reducing both storage space and communication power.

### IV. MICRO-ARCHITECTURE

Another challenge in OPU design is overlay micro-architecture design. The overlay micro-architecture needs to incur as less control overhead as possible while maintaining easily runtime adjustable and functionality. We design our modules to be parameter customizable, and switch modes at runtime based on parameter registers that directly accepts parameters provided by instructions. The computation engine explores multiple level of parallelisms that generalize well among different kernel sizes. Moreover, CNN operations categorized into the same group are reorganized and combined (discussed in section III-A) so they can be accomplished by the same module to reduce overhead.

As shown in Fig. 4, the overlay micro-architecture can be decomposed into six main modules following the instruction architecture definition. Each module can be controlled by instruction to accomplish functionalities defined in section III-A. Besides, four storage buffers (i.e., input feature map...
buffer, kernel buffer, instruction buffer and output buffer) are placed to cache local data for fast access. With most of the control flow embedded in instruction, overlay only handles the computation of one sub-feature map block. If layer size is larger than the maximum block size allocated, the layer will be sliced into sub-blocks by compiler to fit into the hardware. The optimization of slicing scheme is discussed in section V.

A. Computation unit

How to utilize the same set of PE structures to accommodate for different layers is dominant in the design of CNN acceleration architecture. Conventional designs tend to explore parallelism within single 2D kernel, which is straightforward but comes with two disadvantages, i.e., complex feature map data management and poor generalization among various kernel sizes. As shown in Fig. 5(a), expanding a \( k_x \times k_y \) kernel sized window of feature map requires multiple data read from row and column directions within single clock cycle (step 1). This poses challenge on limited Block Ram bandwidth and generally requires extra complicated data reuse management (like line buffer) to accomplish. Furthermore, data management logic designed for one kernel size cannot be efficiently applied to another one. Similarly, PE architecture optimized for certain kernel size \( k_x \times k_y \) may not fit other sizes very well. That’s why many conventional FPGA designs optimize their architecture on popular \( 3 \times 3 \) kernel and perform the best only on networks with pure \( 3 \times 3 \) layers.

To address this issue, we explore higher level of parallelism and leave 2D kernel being computed sequentially. Fig. 5(b) explains how it works: At each clock cycle, a slice of input channel of depth \( IC_i \) with width and height as \( 1 \times 1 \) is read along with corresponding kernel elements. This fits natural data storage pattern and requires much smaller bandwidth. Parallelism is then implemented within input channel slice \( IC_i \) and output channel slice \( OC_i \). Fig. 5(c) further shows the computation process. For round 0 cycle 0, input feature channel slice from position \((0, 0)\) is read. Then we jump stride \( x \) (\( x = 2 \) is used as example here) and read position \((0, 2)\) in next cycle. Read operation continues until all pixels corresponding to kernel position \((0, 0)\) is fetched out and computed. Then we enter round 1 and read starting from position \((0, 1)\) to get all pixels corresponding to kernel position \((0, 1)\). To finish computing this data block of size \( IN^i \times IM^i \times IC^i \) with current slice of kernel with size \( k_{px} \times k_{py} \times IC_p \times OC_i \) in kernel buffer, \( k_{px} \times k_{py} \times (IC^i / IC_p) \times (OC^i / OC_p) \) rounds are needed.
One implementation of the computation unit is shown in Fig. 6. One single Multiplier Unit (MU) computes two $8 \times 8$ multiplication with one of the inputs kept the same. This constraint comes from the DSP decomposition implementation. Each PE consists of 16 MUs followed by an adder tree structure, thus each PE equals to 32 MACs. For one of our implementations of OPU, we implement 32 PEs within the computation unit, and an adder tree with switch is implemented to sum up results from different group sizes of PEs. The outputs number choices include $[64, 32, 16, 8, 4, 2]$. This allows the computation unit to flexibly fit into the needs of different combinations of input/output channels. For example, current implementation supports $1024$ multiplications, which is able to handle $6 \times 1$ in channel $\times$ out channel pairs: $[512, 2],[256, 4],[128, 8],[64, 16],[32, 32]$ and $[16, 64]$.

Our computation pattern guarantees the uniform data fetching pattern for any kernel size or stride. This greatly simplifies the data management stage before compute operation, and enables higher design frequency with less resource consumption. Moreover, we leverage both the input and output channel level parallelisms in a flexible way. This provides higher flexibility for resource utilization and promises reasonable generalization performance.

B. Data Fetch and Post-Process

The data fetch module reads feature map and kernel data from on-chip buffer, rearranges the data and then sends to the computation unit. As shown in Fig. 7, for input feature map read, FM ADDR GEN takes control parameters from instruction and produce feature map buffer read address at each clock cycle. The parameters include $[X_{\text{min}}, X_{\text{max}}, Y_{\text{min}}, Y_{\text{max}}, X_{\text{stride}}, Y_{\text{stride}}, X_{\text{size}}, Y_{\text{size}}]$. The feature map data read from buffer will be selected and copied by the FM REARR to fit the target computation pair of computation unit. For kernel weights, the bandwidth requirement can be as high as 8192 bit/cycle, since all the parallelisms are explored on the kernel side (input channel / output channel). We choose a computation pattern that shares the same set of kernel weights during one round of computation. Accordingly, kernel weights are only pre-loaded once from buffer for each round. We use $W PRE-LOAD ADDR GEN$ to generate weights address for buffer, and each weight takes 32 cycles to load. This loading time is overlapped with previous round of data fetch process. A pair of local shift registers using ping-pong structure $[W SHIFT REG SET 1, W SHIFT REG SET 2]$ is used to cache the weights.

The data post-process module performs data quantization, partial sum addition, pooling, activation and residual addition. Since pooling, activation and residual are only conducted once for one memory write, they are concatenated with the data memory write module to reduce extra on-chip data movement. As shown in Fig. 8, a data concatenation block is placed right after the input port to collect computation output (the output data number can be $[2, 4, 8, 16, 32, 64]$ based on different computation patterns) and formulate a 64-word long array for later process. A set of adders is used for bias addition and partial sum addition. Then the data quantization is achieved by data shift, cut and round modules that work based on parameters provided by instructions. When the complete output is ready, another part of the post process will be called and conduct pooling, activation and residual addition. The detailed corresponding module structure is shown in the right half of Fig. 8.

C. Memory management

Another crucial issue for CNN acceleration on FPGA is off-chip communication latency. Roof-line model [20] reveals the relationship between bandwidth utilization and computational roof performance. Bandwidth can easily be the bottleneck of performance. Therefore, we utilize a ping-pong structure based caching memory management system to hide off-chip...
communication latency. While one buffer’s data is being fetched, the other buffer can get refilled and updated, which maintains the maximum bandwidth utilization.

Another key point in memory management is data storage format in both local on-board buffer and external memory. For on-board storage format, data from the same channel slice is stored under the same address so that they can be fetched in one clock cycle. The limit of channel slice depth is set to be the width of on-chip buffer to guarantee such memory arrangement. Benefited from computation pattern, enough data bandwidth is provided and no extra memory latency is caused during computing stage.

Moreover, data storage format in external memory influences data transmission efficiency and complexity of memory access unit. We thereby employ a channel sliced scheme for feature map storage. As shown in Fig. 9, data from one channel slice with shape $1 \times 1 \times IC$ is stored adjacently. In this way, data can be streamed on-board in burst mode to fully utilize bandwidth. During the computation process, data chunk loading order is $1 \rightarrow 2 \rightarrow 3 \rightarrow 4$. Each chunk of data represents one sub-feature map block.

D. Irregular operation handling

ResNet [16] and GooglNet [15] exhibit excellent performance with reduced computational requirements compared with VGGNet [14] and AlexNet [18]. However, they also introduce new non-convolution operations that require extra attention.

Inception module. Channel-wise input concatenation is required for inception module. Taking advantage of our memory access pattern, we manipulate output memory addresses of preceding layers to place their outputs adjacently. Therefore, input concatenation can be achieved without any extra computational operation or latency cost. The arranged memory can be loaded for the next layer in burst mode for efficient off-chip memory transmission.

Residual module. The operation of residual module is matrix element-wise addition, which is not computational intensive but may cause extra stage of off-chip data communication. We embed the addition operation into post process pipeline and cover the additional block loading stage with computation time, so residual operation is executed with negligible latency increment. Fig. 10 illustrates how this implementation can realize all types of residual modules. As shown in Fig. 10(a), three kinds of residual paths are labeled out. Moreover, an activation module after element-wise addition is optional. We group element-wise addition, pooling and activation together and embed to previous convolution layer’s operation pipeline. Fig. 10(b) shows that an adder array is inserted at the first stage of post-process hardware implementation. All function modules are implemented with bypass logic. With the combination of different bypass and functions, all types of residual layer can be computed. Fig. 10(c) shows the content of ping-pong input buffer at different time steps. At time step 2, current block’s computation is at the final stage and uses data from input buffer A. Meanwhile, residual source is being loaded to input buffer B. At time step 3, both matrices for element-wise addition are ready and post process begins with no extra latency incurred.

V. Compiler

We develop a compiler to perform operation fusion, network slicing, and throughput optimization on input CNN configuration. There are two stages during the operation of compiler: Translation and Optimization, as shown in Fig. 11. Translation extracts necessary information from model definition files and reorganizes them into a uniform intermediate representation (IR) we defined. During this process, operation fusion (introduced in subsection V-A) is conducted to combine closely related operations. Another aspect of translation stage is data quantization and arrangement. Fast Data quantization is performed for Kernel Weights/Bias/Feature Maps with negligible accuracy loss. Generated dynamic fixed-point representation system is then merged into IR. Moreover, processed weights get re-arranged based on network slicing and optimization results from the optimization stage. Optimization stage parses Translation generated IR and explores the solution space of network slicing to maximize throughput. For this stage, an optimizing scheme for slicing is developed, which will be introduced in subsection V-D. In the end, the optimization solution is mapped to an instruction sequence.

A. Operation fusion

Conventional CNNs contain various types of layer that are connected from top to bottom to form a complete flow. In
order to avoid the off-chip memory communication between layers as to reduce computation requirements, operation fusion is employed to merge or concatenate related layer operations. Here we define two types of fusions, \( p\)-fusion and \( r\)-fusion. \( p\)-fusion indicates operation fusion that only contributes to off-chip memory access reduction, and \( r\)-fusion not only avoids communication latency but also reduces the total number of operations and inference time. Among them, \( p\)-fusion and \( r\)-fusion-I are hardware micro-architecture independent, while \( r\)-fusion-II is related to actual implementation scale.

\( p\)-fusion. We perform \( p\)-fusion to concatenate different layers into the same data stream pipeline, which prevents saving intermediate results back to off-chip memory. Convolution and Fully connected layers are major layers. Pooling, Padding, Activation, Residual and Output concatenation layers are treated as affiliated layers. Fig. 12 shows how this is applied on a set of layers. Originally off-chip memory access is required between each pair of adjacent layers, such as \{Padding,Convolution\}, \{Convolution,ReLU\} and \{ReLU,Pooling\}. After employing \( p\)-fusion \( 1 - 3\), memory access is reduced to only two times. The fused layers are called a layer group.

\( r\)-fusion. Fusion that decreases the amount of hardware arithmetic computations by merging adjacent operations is also conducted. There are mainly two kinds of \( r\)-fusions.

\( r\)-fusion-I is Batch normalization [21] elimination. A batch normalization operation can be represented by:

\[
y = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta,
\]

where \( \gamma, \mu, \sigma, \beta \) and \( \epsilon \) are fixed values during inference time. \( x \) and \( y \) represent input and output matrix of one channel. Eq. (1) is essentially a linear function of input matrix, which can be merged to preceding or succeeding convolutional layers based on network structure. This merging avoids the separate computation of Batch normalization, thus reducing the inference time. The merged convolutional layer works with modified weights and bias that incorporate Batch normalization coefficients.

\( r\)-fusion-II is input sharing. In many cases, layers that share identical inputs can be computed by the same round to fully utilize computational resources and reduce memory communication time. As shown in Fig. 13(a), Conv1-Conv4 share the same input layer, but the numbers of their output channels are either small or cannot be evenly divided by available computing resources. Therefore, it takes 5 rounds to compute the four layers sequentially with relatively low runtime efficiency of PEs. To make use of idle resources, we add an optimization step into compiler that identifies input sharing layers and reassembles them based on their individual resource consumption, as shown in Fig. 13(b). Fig. 14 shows the improvement of Computation resource runtime efficiency after applying input sharing \( r\)-fusion to different inception modules. Different groups indicate different scales of computing resource, which increase from left to right. More resources available means higher possibility of resource idling, thus providing more room for input sharing optimization.

B. Data Quantization

It has been proven that CNNs are robust against precision reduction [6] [22]. To reduce memory footprint and save computational resources, we use limited precision fixed-point values during computation, data type can be set to fixed-4/8/16 bit based on platform constraint and network accuracy requirements. Taking general network precision redundancy and hardware architecture complexity into consideration, 8 bit
is chosen as our data quantization standard for both feature map and kernel weights. We employ a fast yet effective stationary quantization method. Dynamic quantization scheme (similar as [6] [23]) has been employed for better accuracy. In such scheme, each layer’s kernel weights and feature maps have their own range for higher precision. The process of finding the best range for each trunk of data is described as follows:

$$\text{argmin}_{\text{floc}} \sum (\text{float} - \text{fix}(\text{floc}))^2, \quad (2)$$

where \text{float} is the original single precision representation of kernel weights or the feature maps, and \text{fix}(\text{floc}) is the value after the \text{float} is cut into fixed-point based on certain fraction length \text{floc}. Table I shows the quantization accuracy evaluation of 9 experiment networks, where the accuracy loss is within 1% on average.

### C. Intermediate representation (IR)

The IR is defined based on layers after the \text{p-fusion}, which we call layer groups, and each layer group gets represented by a set of coefficients, as shown in Table II. The representation is easy to expand by adding more terms to accommodate for new network features.

IR contains all the operations included in the current layer groups. Layer index is the sequential number assigned to each conventional layer. Single layer group may have multiple layer index for input in the case of inception module, where various previous outputted feature maps (FMs) are concatenated to form the input. Meanwhile, multiple intermediate FMs generated during layer group computation can be used as other layer groups’ residual or normal input sources. Dump out location specifics which set of FM to dump out to the off-chip memories. Element-wise operation location is used to flexibly adjust element-wise residual addition position in layer groups, so any combination of pool/activation/element-wise addition can be accommodated.

### D. Slicing and allocation

In this stage, IR generated by Translation stage is parsed to get network architecture after operation fusion and quantization. Then, an automatic optimizer is applied to explore optimal slicing scheme that maps current architecture to overlay with maximum throughput.

Suppose an individual layer \(i\) is sliced into \(p^i\) blocks. One slicing scheme for layer \(i\) can be defined as a vector of parameter groups \(\vec{\mathcal{P}}^i: [(IN^i_{j_n}, IM^i_{j_m}, IC^i_{j_c}, OC^i_{j_c})]_{j_n \in [0, p^i_n)}, j_m \in [0, p^i_m), j_c \in [0, p^i_c))]. Each parameter group \((IN^i_{j_n}, IM^i_{j_m}, IC^i_{j_c}, OC^i_{j_c})\) decides one round of overlay computation, where \(IN^i_{j_n}, IM^i_{j_m}, IC^i_{j_c}\), and \(OC^i_{j_c}\) represent input block width, height, input depth, and output depth, respectively. Then we have:

$$N^i_{in} = \sum_{j_n} p^i_n IN^i_{j_n}, \quad M^i_{in} = \sum_{j_m} IM^i_{j_m},$$

$$C^i_{in} = \sum_{j_c} \frac{IC^i_{j_c}}{\#C^i_{out} \text{ slices}}, \quad C^i_{out} = \sum_{j_c} \frac{IC^i_{j_c}}{\#C^i_{in} \text{ slices}}, \quad (3)$$

and

$$p^i = p^i_n \times p^i_m \times p^i_c. \quad (4)$$

The inference latency \(L^i_j\) of one round computation can be represented by

$$L^i_j = (k^i_x \times k^i_y + 2) \times ON^i_{j_n} \times OM^i_{j_m}, j \in [0, p^i)$$

where \(ON^i_{j_n}\) and \(OM^i_{j_m}\) indicate the output block width and height, and the +2 term accounts for initial memory read and final memory write latency. Then we define the throughput of inferencing a network as

$$T = \frac{\sum_{i} \sum_{j} N^i_{out} \times M^i_{out} \times (2 \times C^i_{in} \times k^i_x \times k^i_y - 1) \times C^i_{out}}{\sum_{i} \sum_{j} L^i_j}, \quad (6)$$

where \(m\) represents the number of layers after fusion.

Therefore, the target of slicing optimization can be represented as

$$\max_{p} T$$

s.t. \(IN^i_{j_n} \times IM^i_{j_m} \leq \text{depth}_{\text{thres}}\)

$$\text{ceil}(\frac{IC^i_{j_c}}{V_{PE}}) \times OC^i_{j_c} \leq N_{PE}\)

$$IC^i_{j_c} \leq \text{width}_{\text{thres}}, \quad (7)$$

where \text{depth}_{\text{thres}} and \text{width}_{\text{thres}} stand for on-chip BRAM depth and width limit, respectively. The solution obtained from Eq. (7) also maximums \(\alpha\).

### E. Extra Efficiency improvements

In most cases, combining two levels of parallelism \(IC^i \times OC^i\) utilizes a large portion of PE resources. However, rare cases exist where \(C^i_{in}\) and \(C^i_{out}\) are too small and offer limited parallelism. This usually happens in the first layer, where \(C^i_{in}\) is fixed to 3 and \(C^i_{out}\) is usually below 64. To accommodate for this situation, our proposed compiler further rearranges input feature maps. Pixels in kernel window are moved to fill the channel dimension to increase the parallelism availability channel-wise, as shown in Fig. 15. This rearrangement is able to gain over 10× speedup for the first layer computation on average.

### VI. Experiment Results

We implement three OPU versions with different MAC numbers on Xilinx XC7K325T FPGA and XC7Z100 FPGA. Corresponding resource utilization is shown in Table III. For OPU1024, all the MACs are implemented with DSP. For
Table I: 8-bit quantization evaluation for different networks.

<table>
<thead>
<tr>
<th>Network Type</th>
<th>VGG16 Float 32 bit</th>
<th>VGG19 Fixed 8 bit</th>
<th>Inception V1 Resnet v1 90%</th>
<th>Inception V2 Resnet V1 92.9%</th>
<th>Inception V3 resnet V2 93.5%</th>
<th>YOLO V2 85.93%</th>
<th>Tiny YOLO 89.8%</th>
</tr>
</thead>
<tbody>
<tr>
<td>Classification</td>
<td>89.8%</td>
<td>85.2%</td>
<td>87.29%</td>
<td>90.30%</td>
<td>95.15%</td>
<td>92.9%</td>
<td>93.7%</td>
</tr>
<tr>
<td>Detection</td>
<td>89.4%</td>
<td>84.3%</td>
<td>85.49%</td>
<td>89.87%</td>
<td>91.41%</td>
<td>92.3%</td>
<td>93.2%</td>
</tr>
</tbody>
</table>

- a: Reported accuracy are top-5 accuracy evaluated on Imagenet ILSVRC2012 validation set
- b: Reported value are mAP
- c: Evaluated on a private subway x-ray dataset
- d: Evaluated on a private traffic view dataset

Table II: IR content for single layer group.

<table>
<thead>
<tr>
<th>Layer Type</th>
<th>Input idx</th>
<th>Output idx</th>
<th>Input size</th>
<th>Output size</th>
<th>Act. info type</th>
<th>Pool info</th>
<th>Padding info</th>
<th>Activation info</th>
<th>Dump out loc</th>
<th>Element-wise op loc</th>
<th>Quantization info</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fully connected</td>
<td>(0)</td>
<td>Conv</td>
<td>(1)</td>
<td>extra pool</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Fully connected</td>
<td>(0)</td>
<td>Conv</td>
<td>(1)</td>
<td>extra pool</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 15: Input Image Rearrangement: channel dimension filling.

OPU2048 and OPU4096, part of the MACs are implemented with LUT since the number of DSPs are not enough. A PC with Xeon 5600 CPU is used for our compiler program. Result interface and device are shown in Fig. 16.

Figure 16: Evaluation board and runtime results for classification network VGG16 and detection network YOLO.

A. Network Description

To evaluate the performance of OPU, 9 CNNs of different architectures are mapped, including YOLOv2, tiny-YOLO, VGG16, VGG19, InceptionV1/v2/v3, Resnetv1-50 and Resnetv1-101. Among them YOLOV2 and tiny-YOLO are object detection networks and the rest are image classification networks. Detailed network architectures are shown in Table IV. Different Kernel sizes from square kernel (1 × 1, 3 × 3, 5 × 5, 7 × 7) to sliced kernel (1 × 7, 7 × 1) are used, in addition to various pooling (1 × 2, 2 × 2, 3 × 1, 3 × 2, 7 × 1, 7 × 2, 8 × 2) sizes and activation types (ReLU and Leaky ReLU). Irregular operations such as inception module and residual module are included as well. All networks are quantized to 8 bit precision for both kernel weights and feature map to achieve higher efficiency.

B. Runtime MAC Efficiency (RME)

OPU is designed to be a domain specific processor for a variety of CNNs. Therefore, Runtime MAC Efficiency (RME) for different CNN’s is an important metric for both hardware efficiency and performance. RME is calculated by the actual throughput achieved during runtime, divided by the theoretical roof throughput (TTR) of design. The TTR can be calculated as follows,

\[ TTR = MAC_{num} \times 2 \times f, \]  

where \( MAC_{num} \) indicates the number of MACs and \( f \) represents design frequency. For instance, in our implementation of OPU1024 running with 200 MHZ, 1024 MAC units are utilized for PE array. Therefore, we have \( TTR_{opu1024} = 1024 \times 2 \times 200 \times 10^6 = 409.6 \text{ GOPS} \). The actual throughput achieved by running the convolution part of VGG network on the OPU is 397 GOPS, so the RME for VGG (CONV) is \( \frac{397}{409.6} = 97.79\% \). High RME indicates all computational resources of hardware are well-utilized and processor is efficient for the current network. Note that for fully connected (FC) layers, a high RME is normally difficult to be obtained under non-batch mode due to large number of weights and relatively small computational requirements. Therefore, for networks that contain FC layers, we report the overall RME under batch mode, and frames per second under both non-batch and batch mode for better evaluation. Processor are running under 200 MHZ. As shown in Table V, on average an overall RME of 91.44% on average is achieved for all test networks. This is even higher than the start-of-the-art customized implementations (will be discussed in VI-C).

C. Comparison with Existing FPGA Accelerators

In this subsection, we compare the performance of OPU with auto-compiler generated network-specific accelerators. Table VI lists out customized accelerators designed for networks VGG16 or YOLO, which are implemented on FPGAs.

Table III: FPGA Resource Utilization.

<table>
<thead>
<tr>
<th>Network Type</th>
<th>LUT</th>
<th>FF</th>
<th>BRAM</th>
<th>DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td>OPU1024</td>
<td>XC7K325T</td>
<td>94763(46.50%)</td>
<td>150848(37.01%)</td>
<td>165(37.08%)</td>
</tr>
<tr>
<td>OPU2048</td>
<td>XC7K325T</td>
<td>129927(63.75%)</td>
<td>233996(57.41%)</td>
<td>165(37.08%)</td>
</tr>
<tr>
<td>OPU4096</td>
<td>XC7Z1000</td>
<td>154516(55.70%)</td>
<td>337651(60.86%)</td>
<td>337444(64.43%)</td>
</tr>
</tbody>
</table>
Table IV: Network Information

<table>
<thead>
<tr>
<th>Network</th>
<th>YOLOv2</th>
<th>tiny-YOLO</th>
<th>VGG16</th>
<th>VGG19</th>
<th>InceptionV1</th>
<th>InceptionV2</th>
<th>InceptionV3</th>
<th>Resnet-50</th>
<th>Resnet-101</th>
</tr>
</thead>
<tbody>
<tr>
<td>Input size</td>
<td>608 × 608</td>
<td>416 × 416</td>
<td>224 × 224</td>
<td>224 × 224</td>
<td>224 × 224</td>
<td>224 × 224</td>
<td>224 × 224</td>
<td>224 × 224</td>
<td>224 × 224</td>
</tr>
<tr>
<td>Kernel size</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
<td>1 × 1, 3 × 3, 5 × 5</td>
</tr>
<tr>
<td>Conv layer number</td>
<td>21</td>
<td>9</td>
<td>13</td>
<td>16</td>
<td>57</td>
<td>69</td>
<td>57</td>
<td>90</td>
<td>53</td>
</tr>
<tr>
<td>Activation Type</td>
<td>Leaky ReLU</td>
<td>Leaky ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
<td>ReLU</td>
</tr>
<tr>
<td>Operations (GOP)</td>
<td>54.07</td>
<td>5.36</td>
<td>30.92</td>
<td>39.24</td>
<td>2.99</td>
<td>3.83</td>
<td>11.25</td>
<td>6.65</td>
<td>12.65</td>
</tr>
</tbody>
</table>

Table V: RME of OPU1024 for Different Networks.

<table>
<thead>
<tr>
<th>Network</th>
<th>YOLOv2</th>
<th>tiny-YOLO</th>
<th>VGG16</th>
<th>VGG19</th>
<th>InceptionV1</th>
<th>InceptionV2</th>
<th>InceptionV3</th>
<th>Resnet-50</th>
<th>Resnet-101</th>
</tr>
</thead>
<tbody>
<tr>
<td>Frequency (MHZ)</td>
<td>200</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RME</td>
<td>95.51%</td>
<td>89.21%</td>
<td>97.18% (B)</td>
<td>97.30% (B)</td>
<td>90.38% (B)</td>
<td>90.48% (B)</td>
<td>91.10 (B)%</td>
<td>84.48%</td>
<td>86.92%</td>
</tr>
<tr>
<td>Conv RME</td>
<td>95.51%</td>
<td>89.21%</td>
<td>97.01%</td>
<td>97.75%</td>
<td>90.45%</td>
<td>90.75%</td>
<td>90.90%</td>
<td>84.48%</td>
<td>86.92%</td>
</tr>
<tr>
<td>Frame/s</td>
<td>7.23</td>
<td>68.32</td>
<td>12.18 (B) / 11.28</td>
<td>9.72 (B) / 9.4</td>
<td>112.48 (B) / 104.47</td>
<td>89.77 (B) / 84.60</td>
<td>30.01 (B) / 27.28</td>
<td>54.36</td>
<td>27.05</td>
</tr>
</tbody>
</table>

Table VI: Comparison with customized accelerators (VGG and YOLO).

<table>
<thead>
<tr>
<th>Network</th>
<th>YOLOv2</th>
<th>tiny-YOLO</th>
<th>VGG16</th>
<th>VGG19</th>
<th>InceptionV1</th>
<th>InceptionV2</th>
<th>InceptionV3</th>
<th>Resnet-50</th>
<th>Resnet-101</th>
</tr>
</thead>
<tbody>
<tr>
<td>Device</td>
<td>XCV7Z045</td>
<td>XCV7Z045</td>
<td>XCV7K325T</td>
<td>XCV7K325T</td>
<td>XCV7Z045</td>
<td>XCV7K325T</td>
<td>XCV7K325T</td>
<td>XCV7K325T</td>
<td>XCV7K325T</td>
</tr>
<tr>
<td>DSP Utilization (bit)</td>
<td>727 (900)</td>
<td>780 (900)</td>
<td>824 (900)</td>
<td>516 (840)</td>
<td>190 (220)</td>
<td>516 (840)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Data format (bit)</td>
<td>16</td>
<td>16</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>200</td>
<td>200</td>
<td>200</td>
<td>200</td>
</tr>
<tr>
<td>Frequency (MHZ)</td>
<td>120</td>
<td>150</td>
<td>100</td>
<td>200</td>
<td>214</td>
<td>200</td>
<td>200</td>
<td>200</td>
<td>200</td>
</tr>
<tr>
<td>TTR (GOPS)</td>
<td>174</td>
<td>234</td>
<td>329</td>
<td>412</td>
<td>162</td>
<td>412</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Throughput (GOPS)</td>
<td>118 / 137 (C)</td>
<td>137 / 188 (C)</td>
<td>230</td>
<td>354 / 397 (C)</td>
<td>62.9</td>
<td>366</td>
<td>391</td>
<td></td>
<td></td>
</tr>
<tr>
<td>RME</td>
<td>67% / 78 (C)%</td>
<td>58% / 79% (C)</td>
<td>69%</td>
<td>86% / 97% (C)</td>
<td>39%</td>
<td>89%</td>
<td>95%</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

C: Convolutional layer only

of similar scales for fair comparison. We use throughput and RME as the comparison criteria. The number of MACs and design frequency decide the TTR that can be achieved by certain design. RME shows how well the accelerator utilizes all its available resources actually, which can be directly transformed to real throughput based on TTR. When estimating the TTR of reference design, we also take the data format into consideration. One 25bit × 18bit DSP can be decomposed to handle two 8bit × 8bit multiplications, while it can only handle one 16bit × 16bit multiplication. Therefore, for the same number of DSPs, 8bit system has twice overall computational capability compared with 16bit system. A coefficient α is used to account for the influence of data format. The RME of different designs can be computed as follows,

\[ RME = \frac{T}{\alpha \times DSP_{num} \times f \times 2} \]  

where \( T \) represents the real throughput achieved by the design, \( DSP_{num} \) indicates the total number of DSP utilized. Here \( \alpha = 1 \) for 16bit system, and \( \alpha = 2 \) for 8bit system. The \( \times 2 \) term is used to compute both multiplication and addition operation. Compiling VGG and YOLO for OPU takes much less time compared to generating one network-specific accelerator by automatic compile. Yet, OPU achieves better RME compared with automatically compiled network-specific accelerators. For example, OPU has 86% RME while the RME for other VGG-specific accelerators ranges from 58% to 69% in table VI.

Direct comparison of throughput without taking the number of utilized MACs into consideration is not fair, so we scale our design to match the number of MACs in different reference designs. As shown in Fig. 17, the blue dots represent the simulated real performance of OPU implemented with different number of MACs running VGG16. Increasing of the MAC number leads to the improvement of the real throughput. But their relationship is not linear, as MAC number that can not evenly divide the available parallelism of network very well may have low RME. In such cases, extra MACs come with no additional throughput (as indicated by the horizontal lines formed by the blue dots). Purple dots indicate the throughput of reference designs. Most of them locate in the area below OPU “line”, which means the similar number of MAC provides lower throughput compared with OPU. Among all the reference designs, implementations in [11] and [25] has higher performance compared with OPU. The performance of [11] is achieved by its high design frequency (231 MHZ), which is enabled by the systolic PE matrix architecture that shares inputs with neighboring PEs to provide simple routing. However, this architecture is easily constrained by the specific network structure, thus requires the change of PE matrix shape for every new network to achieve claimed performance. (PE matrix shape of [11, 14, 8] and [8, 19, 8] are used for Alexnet and VGG separately [11].) Authors of [25] create their own version of pruned VGG to improve performance. Moreover, they use layer-pipe-lined design which accelerate several layers of one network independently. It requires large modifications of FPGA implementation for different networks.
([#DSP, #BRAM] resource of [680, 542] and [808, 303] are used for Alexnet and VGG, respectively).

Normally the design parameters of customized accelerators are tuned specifically to fit the configuration of target network. Therefore, they are able to gain better performance. For example, an accelerator design for VGG can be specifically tuned to fit only the $3 \times 3$ kernel size and cannot perform well with YOLO where $1 \times 1$ kernel size is included. Moreover, the control flow and data buffer need to be changed to fit different target network. We summarize our performance advantages over customized accelerators into two points. First, input and output channel parallelism has less variation compared with kernel/input feature map pixel parallelism. Combined with the first layer re-arrangement, it can provide enough parallelism to fully utilize the computation resources and gain high throughput. For example, if only paralleling the kernel and input channel, the computation resource can easily fall under-utilized when kernel size is too small. Second, our PE array is designed to fit various types of [input, output] pairs, which cover the majority of the configurations in CNNs. Therefore, we can achieve higher throughput than designs with fixed number of input and output channels.

![Figure 17: Performance comparison of OPU implemented on similar number of MACs with reference designs.](image)

**D. Power comparison**

We compare the power efficiency of OPU with other FPGA designs as well as GPU and CPU. Table VII lists out the comparison results of different hardware platform running VGG16. We measure the power consumption of OPU using a PN2000 electricity usage monitor. The reported power includes static and dynamic power consumed by the whole board. FPGA designs generally have better power efficiency compared with CPU, around 2x better power efficiency compared with GPU with batch = 1 and no obvious advantage compared with GPU with larger batch in most cases [10] [22] [26]. Considering the fast development speed of GPU devices, we include the comparison results of three GPUs of different technologies. GTX 780 uses 28nm (same as OPU), GTX 1080 and Titan Xp running with batch = 1, the power efficiency of OPU1024 is 2.13x and 5.35x better. For Titan Xp running with batch = 64, OPU1024 is 1.25x better. For Jetson Tx2 running with batch = 16, OPU1024 shows 89% power efficiency. However, big batch size indicates over 6x larger latency compared with batch = 1 mode. For edge computing targeted device the real-time batch = 1 mode with short latency performance is more important.

**E. Case study of real-time cascaded networks**

To further evaluate real-time performance on cascaded networks of OPU, we implement the task on OPU1024 to recognize car license plate from street-view pictures. It is composed of three networks: car-YOLO (trained based on YOLO for car detection), plate-tiny-YOLO (trained based on YOLO for plate detection) and a character recognition network (cr-network). For single picture input, the car-YOLO network run to detect the plate numbers or characters for each car. YOLO for car detection), plate-tiny-YOLO (trained based on YOLO for plate detection) and a character recognition network (cr-network). For single picture input, the car-YOLO network runs first to label all cars. Then plate-tiny-YOLO and cr-network run to detect the plate numbers or characters for each car.

As shown in Table VIII, we compare the performance for OPU1024 and Jetson Tx2. Tx2 is running with batch = 5 and the speed data is computed by total time between input to output divided by 5. It can be seen that OPU is faster in executing all three networks compared to Jetson. Overall, OPU is 2.9x faster than Jetson. With similar computation capability, the higher speed achieved by OPU comes from the higher PE utilization rate enabled by our domain specific architecture and compiler.

**VII. CONCLUSIONS AND DISCUSSIONS**

In this paper we propose OPU, a domain-specific FPGA-based overlay processor system targeting at general CNN
acceleration. It is software-programmable with short compilation time. We develop a set of instructions with granularity optimized for PE (processing element) efficiency. Our compiler performs operation fusion to reduce inference time, and conducts network slicing to maximize overall throughput and RME. OPU exhibits high flexibility and generality for a variety of CNNs with average RME around 91%. It has a higher RME on VGG16 and YOLO networks compared with state-of-the-art network-specific auto-compiler generated accelerators. For power consumption, OPU of different scales show 1.2× to 5.35× better power efficiency compared with GPU (batch = 1, batch = 64) and other FPGA design. Moreover, for cascaded CNN networks to detect car license plate, OPU is 2.9× faster compared with edge computing GPU Jetson Tx2 with similar amount of computing resources.

Our future work will develop better micro-architecture and more compiler optimization mechanisms, extend OPU for both CNN and RNN, and also apply and optimize OPU for different deep learning applications, particularly for three dimensional medical images.

REFERENCES