Changeset 3663 for Papers

Ignore:
Timestamp:
Mar 16, 2017, 3:59:16 PM (2 years ago)
Message:

more work on compiler architecture section

File:
1 edited

Legend:

Unmodified
 r3660 Where appropriate, we make reference to the source documentation of these two compilers, to explain our intermediate languages. Note: though many of the compiler intermediate languages (especially in the front end) are shared with CompCert, the architecture of the CerCo compiler is different in quite a substantial way from that of CompCert. In particular, the CompCert compiler's backend linearises' its code much earlier, compared to our own. Indeed, as explained below, the CerCo compiler works with a control-flow graph based representation of the program almost until producing machine code, only linearising in the translation from LTL to LIN intermediate languages. This is merely a design variation in our two compilers, and is not motivated by any real technical issue pertinent to the proof of correctness, or due to the lifting of cost models. \subsection{Compiler front-end} The compiler front-end lowers a program written in C to a low-level language of control-flow graphs, ready for the entry-point of our compiler backend. In total, there are three intermediate languages in the front-end. The first translation step of the compiler lowers the full C language to a high-level intermediate language, Clight---this language will be explained further below. This translation phase is unverified, using the CIL C parsing utility~\cite{cil02} to parse a C file and produce a Clight AST. We therefore do not expand on this here, and only note its existence. \paragraph{Clight} Clight is a large subset of the C language, developed by the CompCert team for use in the CompCert verified C compiler, that we adopt as the source language of our compiler. This translation phase is unverified, using the CIL C parsing utility~\cite{cil02} to handle the intricacies of parsing C. We produce a Clight AST directly from a C source file, using this tool. As this translation step is unverified we do not expand any further on it here, and only note its existence. \paragraph{Clight} is a large subset of the C language, developed by the CompCert team for use in the CompCert verified C compiler, that we adopt as the source language of our compiler. We refer to~\cite{compcert} for a formal definition of the Clight language. In order to limit the implementation effort of verifying the CerCo compiler, our current compiler for Clight does {\em not} cover the operators relating to the floating point type {\tt float}. \paragraph{Cminor} Cminor is a simple, low-level imperative language, comparable to a stripped-down, typeless variant of C. \paragraph{Cminor} is a simple, low-level imperative language, comparable to a stripped-down, typeless variant of C. The Cminor intermediate language again is taken directly from the CompCert verified C compiler, and we refer to the CompCert project for its formal definition. Like Clight, the Cminor language is structured into expressions, statements, functions, and whole programs. This information is produced by a static analysis that determines the variables whose address may be taken'. When translating from Clight to Cminor, space is reserved for local arrays and structures. In a second step, the proper compilation of Clight programs into Cminor programs performed: it consists mainly in translating Clight control structures to the more primitive control-flow constructs available in Cminor. \paragraph{RTLabs} RTLAbs is the last architecture independent language in the compilation process. It is a rather straightforward {\em abstraction} of the {\em architecture-dependent} RTL intermediate language available in the CompCert project and it is intended to factorise some work common to the various target assembly languages (e.g. optimisations) and thus to make retargeting of the compiler a simpler matter. In a second step, the proper compilation of Clight programs into Cminor programs is performed: consisting mainly of the translation of Clight control structures to the more primitive control-flow constructs available in Cminor. \paragraph{RTLabs} is the last architecture independent language in the compilation process. It is a rather straightforward abstraction of the architecture-dependent RTL intermediate language used in the CompCert project. RTLabs is intended to ease retargeting of the compiler---should that be desired---by factorising some work common to the various target assembly languages, for example optimisations. In RTLAbs the structure of Cminor expressions is lost. This may have a negative impact on the instruction selection steps that follow in the compiler backend. \subsection{Compiler back-end} \paragraph{RTL} \paragraph{ERTL} \paragraph{LTL} \paragraph{LIN} \subsection{Target hardware} \subsubsection{RTL} As in RTLAbs, the structure of RTL programs is based on CFGs. RTL is the first architecture-dependant intermediate language of our compiler which, in its current version, targets the MCS-51 assembly language. \paragraph{Syntax.} RTL is very close to RTLAbs. It is based on CFGs and explicits the MCS-51 instructions corresponding to the RTLAbs instructions. Type information disappears: everything is represented using 32 bits integers. Moreover, each global of the program is associated to an offset. The syntax of the language can be found in table \ref{RTL:syntax}. The grammar rules $\textit{unop}$, $\textit{binop}$, $\textit{uncon}$, and $\textit{bincon}$, respectively, represent the sets of unary operations, binary operations, unary conditions and binary conditions of the MCS-51 language. \paragraph{RTL} is the first architecture-dependant intermediate language of our compiler, and the entry point of the compiler back end. RTL programs are structured as CFGs, like RTLabs. The syntax of RTL is very close to RTLAbs, but now MCS-51 instructions are made explicit. Type information completely disappears: everything is represented using 32-bit words. Moreover, each global variable of a program is associated to a fixed offset, as we begin to fix the eventual memory layout of the program in the processor's memory. The syntax of the RTL language is provided in Table~\ref{RTL:syntax}. Within the grammar $\textit{unop}$, $\textit{binop}$, $\textit{uncon}$, and $\textit{bincon}$, represent the sets of unary operations, binary operations, unary conditions and binary conditions of the MCS-51 machine code language, lifted into RTL, respectively. \begin{table} \end{table} \paragraph{Translation of RTLAbs to RTL.} This translation is mostly straightforward. A RTLAbs instruction is often directly translated to a corresponding MCS-51 instruction. There are a few exceptions: some RTLAbs instructions are expanded in two or more MCS-51 instructions. When the translation of a RTLAbs instruction requires more than a few simple MCS-51 instruction, it is translated into a call to a function defined in the preamble of the compilation result. \subsubsection{ERTL} As in RTL, the structure of ERTL programs is based on CFGs. ERTL explicits the calling conventions of the MCS-51 assembly language. \paragraph{Syntax.} The syntax of the language is given in table \ref{ERTL:syntax}. The main difference between RTL and ERTL is the use of hardware registers. Parameters are passed in specific hardware registers; if there are too many parameters, the remaining are stored in the stack. Other conventionally specific hardware registers are used: a register that holds the result of a function, a register that holds the base address of the globals, a register that holds the address of the top of the stack, and some registers that need to be saved when entering a function and whose values are restored when leaving a function. Following these conventions, function calls do not list their parameters anymore; they only mention their number. Two new instructions appear to allocate and deallocate on the stack some space needed by a function to execute. Along with these two instructions come two instructions to fetch or assign a value in the parameter sections of the stack; these instructions cannot yet be translated using regular load and store instructions because we do not know the final size of the stack area of each function. At last, the return instruction has a boolean argument that tells whether the result of the function may later be used or not (this is exploited for optimizations). Translating from RTLabs into RTL is mostly straightforward. An RTLAbs instruction is often directly translated to a corresponding MCS-51 instruction. However, there are a few exceptions: some RTLAbs instructions are expanded into multiple MCS-51 instructions. When the translation of an RTLAbs instruction requires more than a few simple MCS-51 instructions, we generate a function implementing the RTLabs instruction, place it in the program's preamble before the entry point, and generate a call to the relevant function, instead of directly generating the instruction sequence. \paragraph{ERTL} programs are again structured as CFGs. In ERTL, calling conventions of the MCS-51 machine language are made explicit. We provide the syntax of the language in Table~\ref{ERTL:syntax}. The main difference between RTL and ERTL is the elimination of pseudo registers, and the use of hardware registers to pass arguments to functions. Parameters are now passed in fixed hardware registers, tied directly to our target hardware. If there are too many parameters, the remaining arguments to be passed are spilled, and are stored in the stack. Further, we now enforce a convention that certain other now-explicit hardware registers are used for certain tasks. We fix registers holding the result of a function, holding the base address of the program's global variables, holding the address of the top of the stack. We also make explicit that some registers must be saved and restored when entering or returning from a function, to avoid clobbering. ERTL function calls do not list their parameters anymore---they merely assert the number of arguments that are being passed. Two new instructions are provided to allocate and deallocate on the stack some space needed by a function to execute. Along with these two instructions come two instructions to fetch or assign a value in the parameter sections of the stack; these instructions cannot yet be translated using regular load and store instructions. This is because we do not know the final size of the stack area of each function at this point in the compilation chain. Note that the ERTL function return instruction has a boolean argument that indicates whether the result of the function may be used later or not. This is exploited in certain optimisations. \begin{table} \end{table} \paragraph{Translation of RTL to ERTL.} The work consists in expliciting the conventions previously mentioned. These conventions appear when entering, calling and leaving a function, and when referencing a global variable or the address of a local variable. \paragraph{Optimizations.} A \emph{liveness analysis} is performed on ERTL to replace unused instructions by a $\textsf{skip}$. An instruction is tagged as unused when it performs an assignment on a register that will not be read afterwards. Also, the result of the liveness analysis is exploited by a \emph{register allocation} algorithm whose result is to efficiently associate a physical location (a hardware register or an address in the stack) to each pseudo register of the program. The majority of the work involved in translating from RTL to ERTL involves making the MCS-51 calling conventions explicit. These conventions appear when entering, calling and leaving a function, and when referencing a global variable or the address of a local variable. At this point, we also perform a \emph{liveness analysis} on ERTL to replace unused instructions by a $\textsf{skip}$, or dummy instruction with no effect on the program state. Our analysis declares an instruction as unused when it performs an assignment on a register that will not be read afterwards. The result of this liveness analysis is also exploited by the \emph{register allocation} algorithm whose result is to efficiently associate a physical location (a hardware register or an address in the stack) to each pseudo register of the input RTL program. \paragraph{LTL} \paragraph{LIN} \subsection{Target hardware} \subsubsection{LTL}