# source:Papers/jar-cerco-2017/architecture.tex@3664

Last change on this file since 3664 was 3664, checked in by mulligan, 21 months ago

finished description of intermediate languages, will now move on to describing assembly and mcs-51 hardware

File size: 24.4 KB
Line
1\newcommand{\Alt}{ \mid\!\!\mid  }
2
3% CerCo compiler architecture
4%   Description of languages
5%   Target hardware description
6
7\section{Compiler architecture}
8\label{sect.compiler.architecture}
9
10In this section we give an overview of the architecture of the CerCo verified C compiler.
11We discuss the compiler's intermediate languages, the optimisations and other transforms that occur at each intermediate language, and discuss the compiler's target hardware: the MCS-51 8-bit microprocessor.
12
13Many of the intermediate languages used in the CerCo verified C compiler are inspired by analogous languages in the CompCert verified C compiler, and a compiler written by Fran\s{c}ois Pottier---used for teaching compiler construction to undergraduates.
14Where appropriate, we make reference to the source documentation of these two compilers, to explain our intermediate languages.
15
16Note: though many of the compiler intermediate languages (especially in the front end) are shared with CompCert, the architecture of the CerCo compiler is different in quite a substantial way from that of CompCert.
17In particular, the CompCert compiler's backend linearises' its code much earlier, compared to our own.
18Indeed, as explained below, the CerCo compiler works with a control-flow graph based representation of the program almost until producing machine code, only linearising in the translation from LTL to LIN intermediate languages.
19This is merely a design variation in our two compilers, and is not motivated by any real technical issue pertinent to the proof of correctness, or due to the lifting of cost models.
20
21\subsection{Compiler front-end}
22
23The compiler front-end lowers a program written in C to a low-level language of control-flow graphs, ready for the entry-point of our compiler backend.
24In total, there are three intermediate languages in the front-end.
25
26The first translation step of the compiler lowers the full C language to a high-level intermediate language, Clight---this language will be explained further below.
27This translation phase is unverified, using the CIL C parsing utility~\cite{cil02} to handle the intricacies of parsing C.
28We produce a Clight AST directly from a C source file, using this tool.
29As this translation step is unverified we do not expand any further on it here, and only note its existence.
30
31\paragraph{Clight} is a large subset of the C language, developed by the CompCert team for use in the CompCert verified C compiler, that we adopt as the source language of our compiler.
32We refer to~\cite{compcert} for a formal definition of the Clight language.
33
34Briefly, Clight features most of the types and operators of C, and includes pointer arithmetic, pointers to functions, and \texttt{struct} and \texttt{union} types, as well as all of C's control structures.
35The language is classically structured with expressions, statements, functions, and whole programs.
36Clight expressions are side-effect free, which means that side-effect operators ($\texttt{=}$,$\texttt{+=}$,$\texttt{++}$, and so on) and function calls within expressions are not supported.
37This is the key difference between the full C programming language and Clight.
38
39In order to limit the implementation effort of verifying the CerCo compiler, our current compiler for Clight does {\em not} cover the operators relating to the floating point type {\tt float}.
40
41\paragraph{Cminor} is a simple, low-level imperative language, comparable to a stripped-down, typeless variant of C.
42The Cminor intermediate language again is taken directly from the CompCert verified C compiler, and we refer to the CompCert project for its formal definition.
43Like Clight, the Cminor language is structured into expressions, statements, functions, and whole programs.
44Control flow constructs are now presented in a simpler form, compared to Clight.
45In particular, Cminor provides only a single, generic loop' construct, into which the different looping constructs of Clight must be translated.
46Switch statements are also present, but in a semi-explicit jump-table form.
47
48In Cminor, stack operations are made explicit, and one must know which variables are stored on the stack when translating from Clight to Cminor.
49This information is produced by a static analysis that determines the variables whose address may be taken'.
50When translating from Clight to Cminor, space is reserved for local arrays and structures.
51In a second step, the proper compilation of Clight programs into Cminor programs is performed: consisting mainly of the translation of Clight control structures to the more primitive control-flow constructs available in Cminor.
52
53\paragraph{RTLabs} is the last architecture independent language in the compilation process.
54It is a rather straightforward abstraction of the architecture-dependent RTL intermediate language used in the CompCert project.
55
56RTLabs is intended to ease retargeting of the compiler---should that be desired---by factorising some work common to the various target assembly languages, for example optimisations.
57In RTLAbs the structure of Cminor expressions is lost.
58This may have a negative impact on the instruction selection steps that follow in the compiler backend.
59However, the subtleties of instruction selection seem rather orthogonal to our goals, and we deem the possibility of retargeting easily the compiler more important than the efficiency of the generated code.
60
61In RTLAbs, programs are represented as \emph{control flow graphs}, or CFGs for short.
62We associate with the nodes of the graphs instructions reflecting commands of the Cminor language.
63As usual, commands that change the control flow of the program (e.g. loops, conditionals) are translated by inserting suitable branching instructions in the CFG.
64
65Local variables are now represented by \emph{pseudo registers}.
66The number of available pseudo registers is unbounded.
67Eventually these pseudo registers will be replaced by real (machine) registers or stack slots, in the compiler backend.
68
69Translating Cminor programs to RTLAbs consists of transforming Cminor commands into a CFG form.
70Most commands are sequential and have a rather straightforward linear translation.
71However, a conditional is translated into a branch instruction, whilst a loop is translated using a back-edge in the CFG.
72
73\subsection{Compiler back-end}
74
75\paragraph{RTL} is the first architecture-dependant intermediate language of our compiler, and the entry point of the compiler back end.
76RTL programs are structured as CFGs, like RTLabs.
77
78The syntax of RTL is very close to RTLAbs, but now MCS-51 instructions are made explicit.
79Type information completely disappears: everything is represented using 32-bit words.
80Moreover, each global variable of a program is associated to a fixed offset, as we begin to fix the eventual memory layout of the program in the processor's memory.
81The syntax of the RTL language is provided in Table~\ref{RTL:syntax}.
82Within the grammar $\textit{unop}$, $\textit{binop}$, $\textit{uncon}$, and $\textit{bincon}$, represent the sets of unary operations, binary operations, unary conditions and binary conditions of the MCS-51 machine code language, lifted into RTL, respectively.
83
84\begin{table}[t]
85{\footnotesize
86$87\begin{array}{lllllll} 88\textit{size} & ::= & \textsf{Byte} \Alt \textsf{HalfWord} \Alt \textsf{Word} & \qquad 89\textit{fun\_ref} & ::= & \textit{fun\_name} \Alt \textit{psd\_reg} 90\end{array} 91$
92
93$94\begin{array}{llll} 95\textit{instruction} & ::= & \Alt \textsf{skip} \rightarrow \textit{node} & 96 \quad \mbox{(no instruction)}\\ 97 & & \Alt \textit{psd\_reg} := n 98 \rightarrow \textit{node} & \quad \mbox{(constant)}\\ 99 & & \Alt \textit{psd\_reg} := \textit{unop}(\textit{psd\_reg}) 100 \rightarrow \textit{node} & \quad \mbox{(unary operation)}\\ 101 & & \Alt \textit{psd\_reg} := 102 \textit{binop}(\textit{psd\_reg},\textit{psd\_reg}) 103 \rightarrow \textit{node} & \quad 104 \mbox{(binary operation)}\\ 105 & & \Alt \textit{psd\_reg} := \textsf{\&globals}[n] 106 \rightarrow \textit{node} & 107 \quad \mbox{(address of a global)}\\ 108 & & \Alt \textit{psd\_reg} := \textsf{\&locals}[n] 109 \rightarrow \textit{node} & 110 \quad \mbox{(address of a local)}\\ 111 & & \Alt \textit{psd\_reg} := \textit{fun\_name} 112 \rightarrow \textit{node} & 113 \quad \mbox{(address of a function)}\\ 114 & & \Alt \textit{psd\_reg} := 115 \textit{size}(\textit{psd\_reg}[n]) 116 \rightarrow \textit{node} & \quad \mbox{(memory load)}\\ 117 & & \Alt \textit{size}(\textit{psd\_reg}[n]) := 118 \textit{psd\_reg} 119 \rightarrow \textit{node} & \quad \mbox{(memory store)}\\ 120 & & \Alt \textit{psd\_reg} := \textit{fun\_ref}({\it psd\_reg^*}) 121 \rightarrow \textit{node} & 122 \quad \mbox{(function call)}\\ 123 & & \Alt \textit{fun\_ref}({\it psd\_reg^*}) 124 & \quad \mbox{(function tail call)}\\ 125 & & \Alt \textsf{test}\ \textit{uncon}(\textit{psd\_reg}) \rightarrow 126 \textit{node}, \textit{node} & \quad 127 \mbox{(branch unary condition)}\\ 128 & & \Alt \textsf{test}\ \textit{bincon}(\textit{psd\_reg},\textit{psd\_reg}) 129 \rightarrow \textit{node}, \textit{node} & \quad 130 \mbox{(branch binary condition)}\\ 131 & & \Alt \textsf{return}\ \textit{psd\_reg}? & \quad 132 \mbox{(return)} 133\end{array} 134$
135
136$137\begin{array}{lllllll} 138\textit{fun\_def} & ::= & \textit{fun\_name}(\textit{psd\_reg}^*) & \qquad 139\textit{program} & ::= & \textsf{globals}: n\\ 140 & & \textsf{result:} \textit{psd\_reg}? & \qquad 141 & & \textit{fun\_def}^*\\ 142 & & \textsf{locals:} \textit{psd\_reg}^*\\ 143 & & \textsf{stack:} n\\ 144 & & \textsf{entry:} \textit{node}\\ 145 & & \textsf{exit:} \textit{node}\\ 146 & & (\textit{node:} \textit{instruction})^* 147\end{array} 148$}
149\caption{Syntax of the RTL language}\label{RTL:syntax}
150\end{table}
151
152Translating from RTLabs into RTL is mostly straightforward.
153An RTLAbs instruction is often directly translated to a corresponding MCS-51 instruction.
154However, there are a few exceptions: some RTLAbs instructions are expanded into multiple MCS-51 instructions.
155When the translation of an RTLAbs instruction requires more than a few simple MCS-51 instructions, we generate a function implementing the RTLabs instruction, place it in the program's preamble before the entry point, and generate a call to the relevant function, instead of directly generating the instruction sequence.
156
157\paragraph{ERTL} programs are again structured as CFGs.
158In ERTL, calling conventions of the MCS-51 machine language are made explicit.
159We provide the syntax of the language in Table~\ref{ERTL:syntax}.
160
161The main difference between RTL and ERTL is the elimination of pseudo registers, and the use of hardware registers to pass arguments to functions.
162Parameters are now passed in fixed hardware registers, tied directly to our target hardware.
163If there are too many parameters, the remaining arguments to be passed are spilled, and are stored in the stack.
164Pseudo registers are not entirely eliminated at this point: they are still used for all other storage barring passing arguments to functions.
165
166Further, we now enforce a convention that certain other now-explicit hardware registers are used for certain tasks.
167We fix registers holding the result of a function, holding the base address of the program's global variables, holding the address of the top of the stack.
168We also make explicit that some registers must be saved and restored when entering or returning from a function, to avoid clobbering.
169
170ERTL function calls do not list their parameters anymore---they merely assert the number of arguments that are being passed.
171Two new instructions are provided to allocate and deallocate on the stack some space needed by a function to execute.
172Along with these two instructions come two instructions to fetch or assign a value in the parameter sections of the stack; these instructions cannot yet be translated using regular load and store instructions.
173This is because we do not know the final size of the stack area of each function at this point in the compilation chain.
174
175Note that the ERTL function return instruction has a boolean argument that indicates whether the result of the function may be used later or not.
176This is exploited in certain optimisations.
177
178\begin{table}[t]
179{\footnotesize
180$181\begin{array}{lllllll} 182\textit{size} & ::= & \textsf{Byte} \Alt \textsf{HalfWord} \Alt \textsf{Word} & \qquad 183\textit{fun\_ref} & ::= & \textit{fun\_name} \Alt \textit{psd\_reg} 184\end{array} 185$
186
187$188\begin{array}{llll} 189\textit{instruction} & ::= & \Alt \textsf{skip} \rightarrow \textit{node} & 190 \quad \mbox{(no instruction)}\\ 191 & & \Alt \textsf{NewFrame} \rightarrow \textit{node} & 192 \quad \mbox{(frame creation)}\\ 193 & & \Alt \textsf{DelFrame} \rightarrow \textit{node} & 194 \quad \mbox{(frame deletion)}\\ 195 & & \Alt \textit{psd\_reg} := \textsf{stack}[\textit{slot}, n] 196 \rightarrow \textit{node} & 197 \quad \mbox{(stack load)}\\ 198 & & \Alt \textsf{stack}[\textit{slot}, n] := \textit{psd\_reg} 199 \rightarrow \textit{node} & 200 \quad \mbox{(stack store)}\\ 201 & & \Alt \textit{hdw\_reg} := \textit{psd\_reg} 202 \rightarrow \textit{node} & 203 \quad \mbox{(pseudo to hardware)}\\ 204 & & \Alt \textit{psd\_reg} := \textit{hdw\_reg} 205 \rightarrow \textit{node} & 206 \quad \mbox{(hardware to pseudo)}\\ 207 & & \Alt \textit{psd\_reg} := n 208 \rightarrow \textit{node} & \quad \mbox{(constant)}\\ 209 & & \Alt \textit{psd\_reg} := \textit{unop}(\textit{psd\_reg}) 210 \rightarrow \textit{node} & \quad \mbox{(unary operation)}\\ 211 & & \Alt \textit{psd\_reg} := 212 \textit{binop}(\textit{psd\_reg},\textit{psd\_reg}) 213 \rightarrow \textit{node} & \quad 214 \mbox{(binary operation)}\\ 215 & & \Alt \textit{psd\_reg} := \textit{fun\_name} 216 \rightarrow \textit{node} & 217 \quad \mbox{(address of a function)}\\ 218 & & \Alt \textit{psd\_reg} := 219 \textit{size}(\textit{psd\_reg}[n]) 220 \rightarrow \textit{node} & \quad \mbox{(memory load)}\\ 221 & & \Alt \textit{size}(\textit{psd\_reg}[n]) := 222 \textit{psd\_reg} 223 \rightarrow \textit{node} & \quad \mbox{(memory store)}\\ 224 & & \Alt \textit{fun\_ref}(n) \rightarrow \textit{node} & 225 \quad \mbox{(function call)}\\ 226 & & \Alt \textit{fun\_ref}(n) 227 & \quad \mbox{(function tail call)}\\ 228 & & \Alt \textsf{test}\ \textit{uncon}(\textit{psd\_reg}) \rightarrow 229 \textit{node}, \textit{node} & \quad 230 \mbox{(branch unary condition)}\\ 231 & & \Alt \textsf{test}\ \textit{bincon}(\textit{psd\_reg},\textit{psd\_reg}) 232 \rightarrow \textit{node}, \textit{node} & \quad 233 \mbox{(branch binary condition)}\\ 234 & & \Alt \textsf{return}\ b & \quad \mbox{(return)} 235\end{array} 236$
237
238$239\begin{array}{lllllll} 240\textit{fun\_def} & ::= & \textit{fun\_name}(n) & \qquad 241\textit{program} & ::= & \textsf{globals}: n\\ 242 & & \textsf{locals:} \textit{psd\_reg}^* & \qquad 243 & & \textit{fun\_def}^*\\ 244 & & \textsf{stack:} n\\ 245 & & \textsf{entry:} \textit{node}\\ 246 & & (\textit{node:} \textit{instruction})^* 247\end{array} 248$}
249\caption{Syntax of the ERTL language}\label{ERTL:syntax}
250\end{table}
251
252The majority of the work involved in translating from RTL to ERTL involves making the MCS-51 calling conventions explicit.
253These conventions appear when entering, calling and leaving a function, and when referencing a global variable or the address of a local variable.
254
255At this point, we also perform a \emph{liveness analysis} on ERTL to replace unused instructions by a $\textsf{skip}$, or dummy instruction with no effect on the program state.
256Our analysis declares an instruction as unused when it performs an assignment on a register that will not be read afterwards.
257The result of this liveness analysis is also exploited by the \emph{register allocation} algorithm whose result is to efficiently associate a physical location (a hardware register or an address in the stack) to each pseudo register of the input RTL program.
258
259\paragraph{LTL} is another CFG-based intermediate language.
260Pseudo registers are now completely eliminated in favour of physical locations---both hardware registers and stack addresses.
261
262Except for a few exceptions, the instructions of the language are those of ERTL with hardware registers replacing pseudo registers.
263Calling and returning conventions were explicited in ERTL; thus, function calls and returns do not need parameters in LTL.
264The syntax of LTL is presented in Table~\ref{LTL:syntax}.
265
266\begin{table}[t]
267{\footnotesize
268$269\begin{array}{lllllll} 270\textit{size} & ::= & \textsf{Byte} \Alt \textsf{HalfWord} \Alt \textsf{Word} & \qquad 271\textit{fun\_ref} & ::= & \textit{fun\_name} \Alt \textit{hdw\_reg} 272\end{array} 273$
274
275$276\begin{array}{llll} 277\textit{instruction} & ::= & \Alt \textsf{skip} \rightarrow \textit{node} & 278 \quad \mbox{(no instruction)}\\ 279 & & \Alt \textsf{NewFrame} \rightarrow \textit{node} & 280 \quad \mbox{(frame creation)}\\ 281 & & \Alt \textsf{DelFrame} \rightarrow \textit{node} & 282 \quad \mbox{(frame deletion)}\\ 283 & & \Alt \textit{hdw\_reg} := n 284 \rightarrow \textit{node} & \quad \mbox{(constant)}\\ 285 & & \Alt \textit{hdw\_reg} := \textit{unop}(\textit{hdw\_reg}) 286 \rightarrow \textit{node} & \quad \mbox{(unary operation)}\\ 287 & & \Alt \textit{hdw\_reg} := 288 \textit{binop}(\textit{hdw\_reg},\textit{hdw\_reg}) 289 \rightarrow \textit{node} & \quad 290 \mbox{(binary operation)}\\ 291 & & \Alt \textit{hdw\_reg} := \textit{fun\_name} 292 \rightarrow \textit{node} & 293 \quad \mbox{(address of a function)}\\ 294 & & \Alt \textit{hdw\_reg} := \textit{size}(\textit{hdw\_reg}[n]) 295 \rightarrow \textit{node} & \quad \mbox{(memory load)}\\ 296 & & \Alt \textit{size}(\textit{hdw\_reg}[n]) := \textit{hdw\_reg} 297 \rightarrow \textit{node} & \quad \mbox{(memory store)}\\ 298 & & \Alt \textit{fun\_ref}() \rightarrow \textit{node} & 299 \quad \mbox{(function call)}\\ 300 & & \Alt \textit{fun\_ref}() 301 & \quad \mbox{(function tail call)}\\ 302 & & \Alt \textsf{test}\ \textit{uncon}(\textit{hdw\_reg}) \rightarrow 303 \textit{node}, \textit{node} & \quad 304 \mbox{(branch unary condition)}\\ 305 & & \Alt \textsf{test}\ \textit{bincon}(\textit{hdw\_reg},\textit{hdw\_reg}) 306 \rightarrow \textit{node}, \textit{node} & \quad 307 \mbox{(branch binary condition)}\\ 308 & & \Alt \textsf{return} & \quad \mbox{(return)} 309\end{array} 310$
311
312$313\begin{array}{lllllll} 314\textit{fun\_def} & ::= & \textit{fun\_name}(n) & \qquad 315\textit{program} & ::= & \textsf{globals}: n\\ 316 & & \textsf{locals:} n & \qquad 317 & & \textit{fun\_def}^*\\ 318 & & \textsf{stack:} n\\ 319 & & \textsf{entry:} \textit{node}\\ 320 & & (\textit{node:} \textit{instruction})^* 321\end{array} 322$}
323\caption{Syntax of the LTL language}\label{LTL:syntax}
324\end{table}
325
326Translating ERTL to LTL relies on the results of the liveness analysis performed on ERTL, and mentioned above, and of the register allocation process.
327Unused instructions are eliminated and each pseudo register is replaced by a physical location.
328In LTL, the size of the stack frame of a function is now known.
329As a result, instructions intended to load or store values in the stack can be translated using regular load and store instructions of the hardware target.
330
331We use a \emph{graph compression} algorithm to remove empty instructions generated by previous compilation passes and by the liveness analysis.
332
333\paragraph{LIN} is a linearised intermediate language, and the final intermediate language before assembly language is produced.
334The body of every function within LIN is now represented as a sequence of instructions.
335Despite one being a CFG language, and the other being linear, the instructions of LIN are very close to those of LTL.
336\emph{Program labels}, \emph{gotos} and branch instructions handle the changes in the control flow.
337The syntax of LIN programs is presented in Table~\ref{LIN:syntax}.
338
339\begin{table}[t]
340{\footnotesize
341$342\begin{array}{lllllll} 343\textit{size} & ::= & \textsf{Byte} \Alt \textsf{HalfWord} \Alt \textsf{Word} & \qquad 344\textit{fun\_ref} & ::= & \textit{fun\_name} \Alt \textit{hdw\_reg} 345\end{array} 346$
347
348$349\begin{array}{llll} 350\textit{instruction} & ::= & \Alt \textsf{NewFrame} & 351 \quad \mbox{(frame creation)}\\ 352 & & \Alt \textsf{DelFrame} & 353 \quad \mbox{(frame deletion)}\\ 354 & & \Alt \textit{hdw\_reg} := n & \quad \mbox{(constant)}\\ 355 & & \Alt \textit{hdw\_reg} := \textit{unop}(\textit{hdw\_reg}) 356 & \quad \mbox{(unary operation)}\\ 357 & & \Alt \textit{hdw\_reg} := 358 \textit{binop}(\textit{hdw\_reg},\textit{hdw\_reg}) 359 & \quad \mbox{(binary operation)}\\ 360 & & \Alt \textit{hdw\_reg} := \textit{fun\_name} 361 & \quad \mbox{(address of a function)}\\ 362 & & \Alt \textit{hdw\_reg} := \textit{size}(\textit{hdw\_reg}[n]) 363 & \quad \mbox{(memory load)}\\ 364 & & \Alt \textit{size}(\textit{hdw\_reg}[n]) := \textit{hdw\_reg} 365 & \quad \mbox{(memory store)}\\ 366 & & \Alt \textsf{call}\ \textit{fun\_ref} 367 & \quad \mbox{(function call)}\\ 368 & & \Alt \textsf{tailcall}\ \textit{fun\_ref} 369 & \quad \mbox{(function tail call)}\\ 370 & & \Alt \textit{uncon}(\textit{hdw\_reg}) \rightarrow 371 \textit{node} & \quad 372 \mbox{(branch unary condition)}\\ 373 & & \Alt \textit{bincon}(\textit{hdw\_reg},\textit{hdw\_reg}) 374 \rightarrow \textit{node} & \quad 375 \mbox{(branch binary condition)}\\ 376 & & \Alt \textit{mips\_label:} & \quad \mbox{(MCS-51 label)}\\ 377 & & \Alt \textsf{goto}\ \textit{mips\_label} & \quad \mbox{(goto)}\\ 378 & & \Alt \textsf{return} & \quad \mbox{(return)} 379\end{array} 380$
381
382$383\begin{array}{lllllll} 384\textit{fun\_def} & ::= & \textit{fun\_name}(n) & \qquad 385\textit{program} & ::= & \textsf{globals}: n\\ 386 & & \textsf{locals:} n & \qquad 387 & & \textit{fun\_def}^*\\ 388 & & \textit{instruction}^* 389\end{array} 390$}
391\caption{Syntax of the LIN language}\label{LIN:syntax}
392\end{table}
393
394Translating LTL to LIN amounts to walking the CFG representation of each LTL function, starting from the main' function, and producing a series of lists of instructions from each function body.
395Some dead-code elimination is implemented here, as functions that are never called are eliminated entirely.
396
397\subsection{Target hardware}
398
399\paragraph{MCS-51 assembly language}
Note: See TracBrowser for help on using the repository browser.