Simulation of the Super-scalar Processor Core Operation

The article considers the features of super-scalar processors, their way of performing several operations on several pairs of operands simultaneously. The research focuses on the organization of processor pipeline execution operation of several machine instructions in one processor core. The simulating kit was developed for better understanding of a processor core microarchitecture. It includes two parts: program and methodical recommendations with multiple task options. The simulating kit demonstrates the pipeline architecture consisting of two clusters: front-end and back-end and the principle of translating complex multi-cycle CISC-like instructions into simpler RISC-like micro-operations. The main types of machine instructions are considered: data transfer between registers and memory cells (four variations), data processing of couple of operands from registers and memory cells (four variations), conditional jump to the specified address. The program-simulator makes it possible to conduct a more detailed simulation of one of the three mechanisms for calculations accelerating in the processor core: multi-functional (super-scalar) processing, out-of-order processing, speculative instructions execution after the branch prediction. The simulating kit is used in educational process when training masters of Higher School of Economics National Research University.


Introduction
Modern super-scalar processors [1][2][3][4] can perform several operations on several pairs of operands simultaneously. This is achieved by a variety of different technological and microarchitectural solutions which parallelize calculations. The main method is the pipeline processing of machine instructions [5][6][7][8]. Each processor may contain one or several cores. Each core contains one or more instruction pipelines. In fact, the pipeline is the main part of the processor. Understanding its execution mechanism gives an understanding of computing principles in modern computers. A huge number of scientific and practical works are devoted to this issue, but the faster and more visual way to study the principle of pipeline processing is the launch and study of the processor core simulation model. The complexity of choosing the degree of modeling detail depends on the problem statement. Electronic device developers create similar cycle accurate simulator in languages supported by CAD tools (such as SystemVerilog, SystemC) to verify the specific device. These models are not freely available. In our case there is an educational task. Free simulators of superscalar processing (which can be found for example here [9]) are either simplified models or simulators of individual technological solutions. Neither professional simulators, nor partial solutions are not suitable for use in the educational process. Therefore, it was decided to develop our own simulator. The peculiarity and advantage of the developed simulator is following. Students get an idea about the microarchitecture features of the modern CPU cores by performing a number of simple operations. The structure and type of these The processor is very complex device, so the model cannot display every aspect of its functioning. We focus only on some of them. We have built and investigated such a model that allows us to trace the process of machine instructions executing by the processor core pipeline. The modeling is based on universal features of Intel processor cores, such microarchitecture as Nehalem [2], Sandy Bridge [5], Haswell [1, 10], Skylake [1,11]. The analyzed mechanisms are inherent to other modern processors as well, for example AMD Zen family microarchitecture [12]. The most interesting mechanisms, which are the most complicated for understanding, are the following: multi-functional (super-scalar) processing [13], out-of-order (OOO) processing [1,10,14], branch prediction [15,16], speculative instructions execution [17], pipeline reset after misprediction [2], the accelerated instructions fetch from L1i cache [18,19], origins of pipeline hazards and bubbles [5,11,14], principles of pipeline division into two clusters: front-end and back-end [3,4,20,21], exception of data re-reading from L1d cache [1,14], etc. Many scientific works and articles describe Intel CPU core microarchitecture. So students will easy find any additional information on all theoretical issues.

The front-end and back-end clusters of pipeline
Two parts of the pipeline, front-end and back-end, allow to integrate advantages of two processors architecture: convenience of the CISC program model and RISC high speed execution [21].

Figure 1. The simulated pipeline
The convenience is provided by the difficult instructions capable to process data from registers and memory. This is implemented via the Front-end CPU cluster. High speed is provided by simple and short operations. CISC-instructions are translated in simple RISC-like microoperations (MOPs) [18] which are processed in the Back-end CPU cluster by a set of executive devices ( Figure 1): ALU (Processing), Load (Read data), Save (Write data). There is an instruction phase/ MOPs designation: F-F-F-F-F (for five Front-end phases)-R-R-R-P-W (for Back-end phases) under the scheme.

The simulating kit description
Simulating kit includes the program and the methodical recommendations [22] -a manual with multiple task options. Simulating kit demonstrates the pipeline architecture, which consists of two clusters: Front-end and Back-end, and the principle of translating complex multi-cycle CISC-like instructions into simpler RISC-like micro operations (MOPs). In addition, it is possible to conduct a more detailed simulation of one of the three accelerating mechanisms in the processor core:  L.R. №1 -multi-functional processing,  L.R. №2 -out-of-order processing,  L.R. №3 -speculative instructions execution after the branch prediction.
The simulating program includes four windows of the model, which reflect the input parameters and simulation results. The computer simulator considers a code fragment written in a simplified

Back-End
(R-R-R-P-W)  Assembler as an initial data. The simulation program automatically generates a new version of the code fragment at startup or at the user's request. The "Task" window looks as shown in Figure 2. To study the pipeline executing principles, the following parameters of the pipeline and of the program code are additionally supposed to specify: the number of store/load devices, the number of ALU (Executive devices), and the percentage of operations with memory cells (Figure 2).

Front-End
Program provides the following options as a result of the simulation: to explore the principle of translation of instructions in micro-operation ("MOPs" window), shown in Figure 3, to study the employment cycles of main back-end pipeline cluster when executing the given instructions ("Pipeline" window), shown in Figure 4, to examine the time diagram of instructions execution sequence from a given code fragment ("Diagram" window), shown in Figure 5.  The 16-byte block of program code (which includes several instructions with a total length up to 16 bytes) is read from the L1i cache to Prefetch Buffer during each CPU clock cycle. Each window displays the executable code fragment. "MOPs" window shows relative instruction number (first column), instruction mnemonics (second column), instruction length (third column), prefetching register names (fourth column), and the columns 5-9 show us the executable MOPs in CPU Back-end cluster devices (Address Generation -GA, Searching data in Cache L1d -SC, Fetching data from cache into Buffer -FB, data Processing -P, Saving data in Register -SR or Saving data in Memory -SM). Sequentially fetching blocks of program code are depicted by alternating light and dark lines. Here one can explore the exception of data re-reading from L1d cache.
For example, instruction 6 must decode into eight MOPs (Table 1). But it's both operands (mem3, mem4) were already read by previous instruction0 (mem3) and instruction1 (mem4). Six MOPs are cancelled. CPU performs fewer operations. Program code executes faster. Table 1. Theoretically possible and actually executed MOPs for instruction 6 (proc mem4,mem3).  One can study the parallel cycle-by-cycle MOPs execution. Out-of-order MOPs processing can be tracked according to the number of the executed MOPs on each clock cycle. For example, five MOPs are executed simultaneously on a cycle number 4: loading data from cache L1d in buffer for instruction 4 (4FB), address generation for instruction 6 (6GA), data processing for instruction 0 (0P), data processing for instruction 9 (9P), saving data in register for instruction 1 (1SR). MOP 9P is processed out of order. "Diagram" window allows one to examine the sequence and duration of instructions execution from given code fragment. Figure 5 (big picture) illustrates the diagram of the twenty instructions execution sequence: a relative instruction number (first column), instruction mnemonic (second column) and the CPU clock cycles from 0 to 16 (3-19 columns) show us the MOPs of instruction execution phase (F-fetch, R-read from memory, P-processing, W-result writing/saving). In the same Figure (small picture) one can see instructions execution sequence for a case of pipeline reset after misprediction for both front-end (F-F-F-F-F) and back-end (R-R-R-P-W) cluster. The situation of pipeline reset is represented by crossing out of the cancelled phases/MOPs (red colour letters).

Figure 5.
Simulator interface, two variants "Diagram" window: multi-functional processing (big picture) and pipeline reset after misprediction (small picture) The sequence in-order recovering for instructions results saving after OOO execution is shown in two lines in the bottom of the diagram. Recovering is executed in two steps: fin (wait in the queue) and release. Diagram makes it easy to realize how common data and devices are used by instructions. One can easily trace the instruction halt.

Conclusion
The offered simulation of the super-scalar processor core allows to study the CPU core execution in dynamics cycle-by-cycle. This Simulating kit is useful for studying computer architecture. Understanding the features of the processor is useful for both system developers and programmers. Improving the style of writing programs will let the computer speed up their execution. The Simulating kit is used when teaching the discipline "Computing systems" in Higher School of Economics.

Acknowledgments
Authors thank the HSE student Lazutov A.N. for help in creation of the kit program part.