Lookup Table Merging for FPGA Technology Mapping

Higher logic density is a key goal pursued by Field Programmable Gate Arrays (FPGA). The higher the logic density is, the lower the cost of the circuit spends, and the shorter the line length of the circuit is, the less power is required. Technology mapping is used to map technology-independent logic gates into a lookup table (LUT) netlist. A big problem with the 6-input LUT used in the FPAG at the present stage is that the utilization rate of the lookup table is insufficient. Modern FPGA architectures have good support for dual-output LUTs. We can merge two small LUTs into a new dual-output LUT, if the total number of inputs of the two small LUTs is not greater than the input limit specified by the LUT, thereby reducing the area, that is, reducing the number of LUTs. In this paper, the dual-output LUTs are merged in the technology mapping stage, and a new method of merging single-output LUTs into dual-output LUTs is proposed. The experimental results show that the algorithm used in this paper has an average reduction of 10.21 % in the number of LUTs without worsening delay.


INTRODUCTION
Field Programmable Gate Array (FPGA) is a semi-custom circuit, which is a product of further development based on programmable devices.Compared with the traditional application-specific integrated circuit (ASIC) chips, FPGAs have higher flexibility and reconfigurability.Process mapping is the front end of the FPGA CAD process, which is used to map technology-independent logic gate circuits into FPGA basic logic unit netlists, such as lookup table (LUT) netlists.LUT-based FPGAs consist of k inputs logic blocks, and any logic function with up to k inputs can be implemented with a K-input lookup table (K-LUT).
Cut enumeration [1] [2] is the classic approach for technology mapping.Through cut enumeration, the subject graph is traversed in topological order, and all cuts are calculated for each node.A trivial cut consists of the node, while a non-trivial cut is a set of nodes taken by its fan-in, and all paths from the primary inputs (PIs) to that node go through one of the sets of nodes.Through mapping, a K-feasible cut is selected in the subject graph as a typical cut for each node, and the process will continuously update the subset of nodes until its typical cut covers all nodes in the subject graph except the PIs.
The main goal of technology mapping is to minimize the area and delay, where the area is the LUTs number and the delay is the LUT netlist depth.Mishchenko et al. [3] proposed a technology mapping algorithm based on priority cuts, applying various heuristic algorithms to optimize cuts selection, at which the algorithm avoids many cuts by computing only a small number of better k-feasible cuts and applies different heuristics to order the cuts at different stages of the map.Vasilyev et al. [4] used simulated annealing to perform logic resynthesis and reduced the number of lookup tables.Liu and Zhang [5] used parallel iterative acceleration technology mapping.These cut-based methods generate single-output LUTs when mapped.In the process of modern FPGA technology mapping, to better optimize the delay and area of the circuit [6] , many large LUTs (such as 6input look-up tables or LUTs with more inputs) can be "split" into two small LUTs, and the large LUT is a dual-output LUT.Unlike traditional LUTs, dual-output LUTs can either be "split" into two smaller sub-LUTs [7] or can be run as one large single-output LUT.On the premise of increasing the area, it has a better delay advantage.
Existing tools use technology mapping to generate single-output LUT and then combine the singleoutput LUTs into dual-output LUTs.Although the previous work has done some optimizations on the merging of dual-output LUTs, such as WireMap [8] using the heuristic algorithm of edge recovery to increase the LUT merging rate.Machado et al. [9] used KL-cuts for multi-output mapping, but the merging efficiency is still low.In this paper, we propose new methods for merging dual-output LUTs produced by mapping.For commercial FPGAs, our method reduces the number of LUT by 10.21%.The main contributions are as follows: We proposed the concept of LUT similarity, and compared the problem of merging LUTs to the weighted matching problem of the graph, using the similarity of LUTs as the weight of edges, and using the bipartite graph maximum weight matching algorithm to perform single-output LUTs.Then we use the weighted graph matching algorithm based on the greedy strategy to optimize the matching of the LUT.In addition, a heuristic merging method based on hierarchical information is proposed for LUTs without similarity, so we improve the merging rate of LUTs and reduce the number of LUTs.

Definition Introduction
A Boolean network is a directed acyclic graph (DAG) of logic gates, where each node corresponds to a logic gate, and each directed edge corresponds to each wire connecting the logic gate.Network, Boolean network, and circuit mean the same thing in this paper.
For each node m in the DAG, there are no less than 0 fan-in and fan-out, where fan-in represents a node driving node m, and fan-out represents a node driven by node m.If nodes are without fan-in, they are primary inputs (PIs), and primary outputs (POs) represent nodes without fan-out.When it is a sequential network, to facilitate the optimization and mapping of combinational logic, the input and output of the registers in the network will also be used as the main input and output.
If the fan-in number of all nodes in a network is not greater than a certain number of K, it means that the network is K-bounded.The AND-INV graph (AIG) consists of two input AND gates and NOT gates.The subject graph used for technology mapping is a K-bounded network, and an AIG can represent any network with combinational logic.The subject graph used in the follow-up is AIG.
The cut C of node m (which becomes the root node at this time) represents a collection of some nodes in the network and is called a leaf node.All paths from the PIs to this node m pass through at least one leaf node.A trivial cut represents a set consisting only of the node m, while a non-trivial cut includes all nodes from the root node to the leaf nodes, including the root node but without the leaf nodes.A cut is called a K-feasible cut.If the number of nodes in the cut C is not greater than K, which can be realized by a K-LUT.If all the nodes contained in a cut C are contained by another cut, the cut is called the dominated cut.

Cut-Based Technology Mapping
The cutting enumeration algorithm comes from [10] and [11], and the subject graph used is AIG.For convenience, let A and B be two sets of cut sets and define A◊B as: where u is a node in A, v is a node in B, and the total input number of u and v can not be greater than K. Let Φ (m) be the cut set of node m where each cut is not greater than K.If the node m is an AND gate node, we will set node m1 and node m2 as its fan-in nodes.Then the cut set of node m is calculated by its fan-in cut set: where if node m belongs to PI, the cut set of m contains only itself, otherwise, it is the union of the cut sets of the node and its two input nodes.From the PIs to the POs, the formula traverses the K-cut nodes of all nodes in topological order.By merging its two child cut sets and adding the trivial cut, the cut set of each AND gate node is obtained.It is necessary to improve the cut sets and remove dominated cuts, and duplicate cuts while cut sets are merged.Computing the cut of the next node removing dominated cuts and duplicate cuts can effectively reduce the computation of the cut set without affecting the quality of the mapping.
Figure 1 shows the cut enumeration process of a small circuit.The cut enumeration will calculate all the cuts of the circuit from the bottom up for each node, and remove the unsuitable cuts in each node in real-time, for example, the cut {a, b, f, c, d} in node i is included by the cut {a, b, c, d}, that is to say, the cut {a, b, f, c, d} is a dominated cut, after calculating all cuts of node i, it will remove the cut.

Figure 1. Cut enumeration example
After cut enumeration, depth-oriented mapping is used to minimize the mapping depth, and then the area recovery is performed by two heuristic algorithms to reduce the number of LUTs, and finally, the mapping result is generated.
The mapping result is in the form of single-output LUT.Modern commercial FPGAs have good support for dual-output LUTs.If a better LUT merging method can be used, the number of LUTs can be further reduced and the logic density of the FPGA can be increased.

Similarity Principle
Figure 2 shows the model of a single-output LUT and a dual-output LUT. Figure 2 (a) shows a LUT model with three inputs and one output, and its inputs generally do not exceed K (the maximum number of inputs, generally 6 in this paper).Figure 2 (b) shows the model that two single-output LUTs are merged into one dual-output LUT.The two single-output LUTs both have 3 inputs, and they have a shared input.After being combined into a dual-output LUT, the input of the double-output LUT is 5, and the output is 2.

Figure 2. Single-output and dual-output LUT model
In FPGA, to reduce the impact on the circuit after merging the dual-output LUT, this paper proposes the principle of similarity.The formula for the similarity S is calculated as follows: where  is the shared inputs of two single-output LUTs;  is the total input of singleoutput LUT1;  is the total input of single-output LUT2.The larger the similarity S is, the more similar the LUT1 and LUT2 are in structure and the less impact the merging of the two has on the circuit.

Two Weighted Graph Matching Algorithms
The small LUT is compared to a point in the graph.If two LUTs have shared inputs, it means that there is an interconnected edge between the two points.The similarity between the LUTs calculated according to the similarity formula can represent the weight of the connecting edge between two points.The matching problem of merging single-output LUTs into double-output LUTs can be compared to a weighted graph matching problem.

Maximum Weight Matching for Bipartite Graph Algorithm
A bipartite graph is a graph consisting of two sets of points, where points in one set can only be connected to points in the other set to form edges.The maximum weight matching of a bipartite graph is to find the maximum matching of edge weights in a bipartite graph.The Hungarian algorithm, also known as the KM algorithm, is used to find the maximum weight perfect matching of a bipartite graph within a time Ο (n3) complexity.If the number of points in two sets in the bipartite graph is not the same, we will be able to use the KM algorithm to solve the bipartite graph maximum weight matching problem.The number of points in the set can be reduced to make up points, so that the number of points in the two sets is the same, and there will be no weight of the edge set to 0. In this case, the bipartite graph maximum weight matching problem is transformed into the bipartite graph maximum weight perfect matching problem, which can be solved by the KM algorithm.In this paper, the small LUT is randomly divided into two equal parts, and the bipartite graph maximum weight matching algorithm is used to match the matching between the two LUT sets.By continuously expanding the augmentation path for the two LUT sets until the corresponding matching set is found, the next step of optimization can be performed.

Weighted Graph Matching Algorithm Based on Greedy Strategy
The graph G = (V, E) stores each node and edge weight information, sorts each edge by weight, and predicts the edge with the largest weight through edge e (u, v), where the points u and v obtain all the edges connected to u and v and compare the weights.If the weight of edge e is the largest, the pairing of points u and v has the best result.After the pairing of points u and v is completed, all adjacent points will be notified and no longer be paired with the paired point.We will iterate over all edges until all pairs are found.Since the bipartite graph maximum weight matching does not consider the interconnection between points in each graph, many matches will be ignored.Therefore, it is necessary to optimize the results generated by the matching, that is, we aim to use the weighted graph matching algorithm based on the greedy strategy to match unmatched LUTs inside two sets of LUTs and optimize matched LUTs.
The main steps in this section are as follows: First, the single-output LUT netlist generated by the mapping is randomly divided into two parts, and the two-part LUT sets are matched by using the bipartite graph maximum weight matching algorithm and the matching results are saved.Then, a weighted graph matching algorithm based on a greedy strategy is used to optimize the matching of the entire set of LUTs.The two LUT sets are integrated into one, and the virtual edges of the lookup table set are sorted according to the similarity to traverse all the virtual edges and are compared with the previous matching to optimize the matching set of the LUT.Finally, the merge is performed according to the optimized matching set.

Merge Based on Hierarchical Information
The previous merging algorithm is generally used to merge LUTs with shared inputs because shared inputs mean that the two LUTs have a certain similarity in structure, the more shared inputs are, the greater the similarity is.The impact on the circuit after merging is smaller and is essentially none for merges that do not share input LUTs.Since there is no shared input, two LUTs may not be suitable for merging.If we want to merge two LUTs without shared input to reduce the number of LUTs, we need to have stricter constraints.This section considers a heuristic algorithm based on hierarchical information to merge dual-output LUTs.For LUTs at the same level and the same submodule, they are more likely to be packed into the same logical block in the subsequent packaging stage.Since such two LUTs have less impact on the circuit after being merged, the small LUTs in the same submodule at the same level can be merged according to the level information.

EXPERIMENTAL RESULTS
This paper uses C++ to implement the algorithm for merging the LUTs proposed.Our experimental environment is an AMD4800H 8-core 16-thread computer with 16 G memory.FPGA technology mapping uses ABC [12] to generate single-output LUT mapping results, and then merges and compares them.First, we use "if -C 8" to obtain the ABC "Unmerged" mapping results, and then use the maximum cardinality matching algorithm to merge the generated mapping results to get "ABC merged".Here we only match LUTs that satisfy the constraints.For our process, we also obtain the ABC "Unmerged" mapping results first and then use the merging algorithm proposed in this paper to merge LUTs to get our results.
Table 1 The comparison of our merged results with ABC is shown in Table 1.From the experimental results, it can be seen that our merging algorithm does not affect the delay of the circuit.Compared with the "Unmerged" number, our merging algorithm reduces the number of LUTs by 10.21% on average, and the largest merging rate is 26.76%.Compared with the "ABC merged" number, the merging algorithm we use has an average increase in the merging rate of up to 3.49%.

CONCLUSION
In this work, we propose the principle of similarity, through the bipartite graph maximum weight matching algorithm and the greedy strategy-based weighted graph matching algorithm to merge small LUTs with similarity into dual-output LUTs, and proposes a heuristic method that utilizes hierarchical information to merge small lookup table without similarities.Experimental results show that our merging algorithm is capable of merging more LUTs, thus reducing the number of LUTs more, without affecting delay.
. Comparison of ABC and our LUT merging