1 | \documentclass[11pt,epsf,a4wide]{article} |
---|
2 | \usepackage[mathletters]{ucs} |
---|
3 | \usepackage[utf8x]{inputenc} |
---|
4 | \usepackage{listings} |
---|
5 | \usepackage{../../style/cerco} |
---|
6 | \newcommand{\ocaml}{OCaml} |
---|
7 | \newcommand{\clight}{Clight} |
---|
8 | \newcommand{\matita}{Matita} |
---|
9 | \newcommand{\sdcc}{\texttt{sdcc}} |
---|
10 | |
---|
11 | \newcommand{\textSigma}{\ensuremath{\Sigma}} |
---|
12 | |
---|
13 | % LaTeX Companion, p 74 |
---|
14 | \newcommand{\todo}[1]{\marginpar{\raggedright - #1}} |
---|
15 | |
---|
16 | \lstdefinelanguage{coq} |
---|
17 | {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record}, |
---|
18 | morekeywords={[2]if,then,else}, |
---|
19 | } |
---|
20 | |
---|
21 | \lstdefinelanguage{matita} |
---|
22 | {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on}, |
---|
23 | morekeywords={[2]whd,normalize,elim,cases,destruct}, |
---|
24 | mathescape=true, |
---|
25 | morecomment=[n]{(*}{*)}, |
---|
26 | } |
---|
27 | |
---|
28 | \lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false, |
---|
29 | keywordstyle=\color{red}\bfseries, |
---|
30 | keywordstyle=[2]\color{blue}, |
---|
31 | commentstyle=\color{green}, |
---|
32 | stringstyle=\color{blue}, |
---|
33 | showspaces=false,showstringspaces=false} |
---|
34 | |
---|
35 | \lstset{extendedchars=false} |
---|
36 | \lstset{inputencoding=utf8x} |
---|
37 | \DeclareUnicodeCharacter{8797}{:=} |
---|
38 | \DeclareUnicodeCharacter{10746}{++} |
---|
39 | \DeclareUnicodeCharacter{9001}{\ensuremath{\langle}} |
---|
40 | \DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}} |
---|
41 | |
---|
42 | |
---|
43 | \title{ |
---|
44 | INFORMATION AND COMMUNICATION TECHNOLOGIES\\ |
---|
45 | (ICT)\\ |
---|
46 | PROGRAMME\\ |
---|
47 | \vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}} |
---|
48 | |
---|
49 | \date{ } |
---|
50 | \author{} |
---|
51 | |
---|
52 | \begin{document} |
---|
53 | \thispagestyle{empty} |
---|
54 | |
---|
55 | \vspace*{-1cm} |
---|
56 | \begin{center} |
---|
57 | \includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png} |
---|
58 | \end{center} |
---|
59 | |
---|
60 | \begin{minipage}{\textwidth} |
---|
61 | \maketitle |
---|
62 | \end{minipage} |
---|
63 | |
---|
64 | |
---|
65 | \vspace*{0.5cm} |
---|
66 | \begin{center} |
---|
67 | \begin{LARGE} |
---|
68 | \bf |
---|
69 | Report n. D3.2\\ |
---|
70 | CIC encoding: Front-end\\ |
---|
71 | \end{LARGE} |
---|
72 | \end{center} |
---|
73 | |
---|
74 | \vspace*{2cm} |
---|
75 | \begin{center} |
---|
76 | \begin{large} |
---|
77 | Version 1.0 |
---|
78 | \end{large} |
---|
79 | \end{center} |
---|
80 | |
---|
81 | \vspace*{0.5cm} |
---|
82 | \begin{center} |
---|
83 | \begin{large} |
---|
84 | Main Authors:\\ |
---|
85 | Brian~Campbell |
---|
86 | \end{large} |
---|
87 | \end{center} |
---|
88 | |
---|
89 | \vspace*{\fill} |
---|
90 | \noindent |
---|
91 | Project Acronym: \cerco{}\\ |
---|
92 | Project full title: Certified Complexity\\ |
---|
93 | Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\ |
---|
94 | |
---|
95 | \clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881} |
---|
96 | |
---|
97 | \newpage |
---|
98 | |
---|
99 | \vspace*{7cm} |
---|
100 | \paragraph{Abstract} |
---|
101 | We describe the translation of the front end of the \cerco{} compiler from the |
---|
102 | \ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the |
---|
103 | \matita{} proof assistant. This transforms programs in the C-like |
---|
104 | \textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably |
---|
105 | target-independent and in the form of a control flow graph. |
---|
106 | |
---|
107 | We also report on progress enriching these transformations with dependent |
---|
108 | types so as to establish key invariants in each intermediate language, which |
---|
109 | removes potential sources of failure within the compiler and |
---|
110 | will assist in future work proving correctness properties. |
---|
111 | |
---|
112 | \newpage |
---|
113 | |
---|
114 | \tableofcontents |
---|
115 | |
---|
116 | % TODO: clear up any -ize vs -ise |
---|
117 | % TODO: clear up any "front end" vs "front-end" |
---|
118 | % TODO: clear up any mentions of languages that aren't textsf'd. |
---|
119 | % TODO: fix unicode in listings |
---|
120 | % TODO: make it clear that our correctness is intended to go beyond compcert's |
---|
121 | % TODO: capitalise deliverable/task? |
---|
122 | |
---|
123 | \section{Introduction} |
---|
124 | |
---|
125 | The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but |
---|
126 | the certified compiler will be a program written in the Calculus of Inductive |
---|
127 | Constructions (CIC), as realised by the \matita{} proof assistant. This |
---|
128 | deliverable reports on the translation of the front-end of the compiler into |
---|
129 | CIC and the subsequent efforts to start exploiting dependent types to maintain |
---|
130 | invariants and rule out potential sources of failure in the compiler. |
---|
131 | |
---|
132 | The input language for the formalized compiler is the \textsf{Clight} |
---|
133 | language. This is a C-like language with side-effect free expressions that |
---|
134 | was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will |
---|
135 | also use their CIL-based C parser to generate \textsf{Clight} abstract syntax |
---|
136 | trees, but will not formalize this code.} and provided with an executable |
---|
137 | semantics. See~\cite{d3.1} for more details on the syntax and semantics. |
---|
138 | |
---|
139 | \begin{figure} |
---|
140 | \begin{center} |
---|
141 | \begin{minipage}{.8\linewidth} |
---|
142 | \begin{tabbing} |
---|
143 | \quad \= $\downarrow$ \quad \= \kill |
---|
144 | \textsf{C} (unformalized)\\ |
---|
145 | \> $\downarrow$ \> CIL parser (unformalized \ocaml)\\ |
---|
146 | \textsf{Clight}\\ |
---|
147 | \> $\downarrow$ \> cast removal\\ |
---|
148 | \> $\downarrow$ \> add runtime functions\\ |
---|
149 | \> $\downarrow$ \> labelling\\ |
---|
150 | \> $\downarrow$ \> stack variable allocation and control structure |
---|
151 | simplification\\ |
---|
152 | \textsf{Cminor}\\ |
---|
153 | \> $\downarrow$ \> generate global variable initialisation code\\ |
---|
154 | \> $\downarrow$ \> transform to RTL graph\\ |
---|
155 | \textsf{RTLabs}\\ |
---|
156 | \> $\downarrow$ \> start of target specific backend\\ |
---|
157 | \>\quad \vdots |
---|
158 | \end{tabbing} |
---|
159 | \end{minipage} |
---|
160 | \end{center} |
---|
161 | \caption{Front-end languages and transformations} |
---|
162 | \label{fig:summary} |
---|
163 | \end{figure} |
---|
164 | |
---|
165 | The front-end of the compiler is summarised in Figure~\ref{fig:summary}. The |
---|
166 | two intermediate languages involved are |
---|
167 | \begin{description} |
---|
168 | \item[\textsf{Cminor}] --- a C-like language where local variables are not |
---|
169 | explicitly allocated memory and control structures are simpler |
---|
170 | |
---|
171 | \item[\textsf{RTLabs}] --- a language in the form of a control flow graph |
---|
172 | which retains the values and front end operations from \textsf{Cminor} |
---|
173 | \end{description} |
---|
174 | More details on the formalisation of the syntax and semantics of these |
---|
175 | languages can be found in the accompanying deliverable 3.4. |
---|
176 | Development of the formalized front-end was conducted in concert with the |
---|
177 | development of these intermediate languages to facilitate testing. |
---|
178 | |
---|
179 | \subsection{Revisions to the prototype compiler} |
---|
180 | |
---|
181 | We have been tracking revisions to the prototype compiler during the |
---|
182 | development of the formalized version. Most of these changes were minor, but |
---|
183 | one exception is a major change to the structure of the compiler. |
---|
184 | |
---|
185 | The original plan for the front-end featured a \textsf{Clight} to |
---|
186 | \textsf{Clight8} phase near the start which replaced all of the integer |
---|
187 | values and operations by 8 bit counterparts, while pointers were split into |
---|
188 | bytes at a later stage. Experience has shown that it would be difficult to |
---|
189 | produce good code with this approach. Instead, we now have: |
---|
190 | \begin{itemize} |
---|
191 | \item full size integers, pointers and operations until code selection (the |
---|
192 | first part of the back-end after \textsf{RTLabs}), and |
---|
193 | \item a cast removal stage which simplifies \textsf{Clight} expressions such |
---|
194 | as |
---|
195 | \begin{lstlisting}[language=C,belowskip=0pt] |
---|
196 | (char)((int)x + (int)y) |
---|
197 | \end{lstlisting} |
---|
198 | into equivalent operations on simpler types, \lstinline'x+y' in this case. |
---|
199 | The cast removal is important because C requires \emph{arithmetic promotion} |
---|
200 | of integer types to (at least) \lstinline'int' before an operation is |
---|
201 | performed. The \textsf{Clight} semantics do not perform the promotions, |
---|
202 | instead they are added as casts by the CIL-based parser. However, our targets |
---|
203 | benefit immensely from performing operations on the smallest possible integer |
---|
204 | type, so it is important that we remove promotions where possible. |
---|
205 | \end{itemize} |
---|
206 | |
---|
207 | This document describes the formalized front end after these changes. |
---|
208 | |
---|
209 | \section{\textsf{Clight} phases} |
---|
210 | |
---|
211 | In addition to the conversion to \textsf{Cminor}, there are several |
---|
212 | transformations which act directly on the \textsf{Clight} language. |
---|
213 | |
---|
214 | \subsection{Cast simplification} |
---|
215 | |
---|
216 | We noted above that the arithmetic promotion required by C (and implemented in |
---|
217 | the CIL-based parser) adds numerous casts, causing arithmetic operations to be |
---|
218 | performed on 32 bit integers. If left alone, the resulting code will be |
---|
219 | larger and slower. This phase removes many of the casts so that the |
---|
220 | operations can be performed more efficiently. |
---|
221 | |
---|
222 | The prototype version worked by recognising fixed patterns in the |
---|
223 | \textsf{Clight} abstract syntax tree such as |
---|
224 | \[ (t)((t_1)e_1\ op\ (t_2)e_2), \] |
---|
225 | subject to restrictions on the types. These are replaced with a simpler |
---|
226 | version without the casts. Such `deep' pattern matching is slightly awkward in |
---|
227 | \matita{} and this approach does not capture compositions of operations, such as |
---|
228 | \begin{lstlisting}[language=C] |
---|
229 | (char)(((int)a + (int)b) + (int)c) |
---|
230 | \end{lstlisting} |
---|
231 | where \lstinline'a', \lstinline'b' and \lstinline'c' are of type |
---|
232 | \lstinline'char', because the intermediate expression is not cast to and from |
---|
233 | \lstinline'char'. |
---|
234 | |
---|
235 | The formalized version uses a different method, recursively examining each |
---|
236 | expression constructor to see if the expression can be coerced to some |
---|
237 | `desired' type. For example, when processing the above expression it reaches |
---|
238 | each \lstinline'int' cast with a desired type of \lstinline'char', notes that |
---|
239 | the subexpression is of type \lstinline'char' and eliminates the cast. |
---|
240 | Moreover, when the recursive processing is complete the \lstinline'char' cast |
---|
241 | is also eliminated because its subexpression is already of the correct type. |
---|
242 | |
---|
243 | This has been implemented in \matita. We have also performed a few proofs that |
---|
244 | the arithmetic behind these changes is correct to gain confidence in the |
---|
245 | technique. During task 3.4 we will extend these proofs to cover more |
---|
246 | operations and show that the semantics of the expressions are equivalent, not |
---|
247 | just the underlying arithmetic. |
---|
248 | |
---|
249 | \subsection{Labelling} |
---|
250 | |
---|
251 | This phase adds cost labels to the \textsf{Clight} program. It is a fairly |
---|
252 | simple recursive definition, and was straightforward to port to \matita. The |
---|
253 | generation of cost labels was handled by our generic identifiers code, |
---|
254 | described in the accompanying deliverable 3.3 on intermediate languages. |
---|
255 | |
---|
256 | \subsection{Runtime functions} |
---|
257 | |
---|
258 | Some operations on integers do not have a simple translation to the target |
---|
259 | machine code. In particular, we need to replace operations for 16 and 32-bit |
---|
260 | division and most bitwise shifts with calls to runtime functions. These |
---|
261 | functions need to be added to the program at an early stage because of their |
---|
262 | impact on execution time: any loops must be available to our labelling |
---|
263 | mechanism so that we can report on how long the resulting machine code will |
---|
264 | take to execute. |
---|
265 | |
---|
266 | We follow the prototype in replacing the affected expressions, which requires |
---|
267 | us to break up expressions into multiple statements because function calls are |
---|
268 | not permitted in \textsf{Clight} expressions. We may investigate moving these |
---|
269 | substitutions to a later stage of the compiler if they prove difficult to |
---|
270 | reason about. However, this would also require adjusting the semantics so |
---|
271 | that the costs still appear in the evaluation of \textsf{Clight} programs. |
---|
272 | |
---|
273 | The prototype adds the functions themselves by generating C code as text and |
---|
274 | reparsing the program. This is unsuitable for formalization, so we generate |
---|
275 | \textsf{Clight} abstract syntax trees directly. |
---|
276 | |
---|
277 | \subsection{Conversion to Cminor} |
---|
278 | |
---|
279 | The conversion to \textsf{Cminor} performs two essential tasks. First, it |
---|
280 | determines which local variables need to be stored in memory and generates |
---|
281 | explicit memory accesses for them. Second, it must translate the control |
---|
282 | structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s |
---|
283 | more basic structures. |
---|
284 | |
---|
285 | These are both performed by code similar to that in the prototype, although |
---|
286 | the use of generic fold operations on statements and expressions has been |
---|
287 | replaced by simpler recursive definitions. |
---|
288 | |
---|
289 | There are two additional pieces of work that the formalized translation must |
---|
290 | do. The \textsf{Cminor} definition features some mild constraints of the |
---|
291 | types of expressions, which we can enforce in the translation using some type |
---|
292 | checking. The error monad is used to dispose of ill-typed \textsf{Clight} |
---|
293 | programs. |
---|
294 | |
---|
295 | The other difficulty is that we need to generate fresh temporary variables to |
---|
296 | store function results in before they are written to memory. This is |
---|
297 | necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions |
---|
298 | as the destination for the returned value, but \textsf{Cminor} only allows |
---|
299 | local variables. All other variable names in the \textsf{Cminor} program came |
---|
300 | from the \textsf{Clight} program, but we need to construct a method for |
---|
301 | generating fresh names for the temporaries. |
---|
302 | |
---|
303 | Our identifiers are based on binary numbers, and generation of fresh names is |
---|
304 | handled by keeping track of the highest allocated number. Normally this is |
---|
305 | initialised at zero, but if initialised by the largest existing identifier in |
---|
306 | the \textsf{Clight} program then the generated names will be fresh. |
---|
307 | To do this, we extract the maximum identifier by recursively finding the maximum |
---|
308 | variable name used in every expression, statement and function of the program. |
---|
309 | |
---|
310 | \section{\textsf{Cminor} phases} |
---|
311 | |
---|
312 | \textsf{Cminor} programs are processed by two passes: one deals with the |
---|
313 | initialisation of global variables, and the other produces \textsf{RTLabs} |
---|
314 | code. |
---|
315 | |
---|
316 | \subsection{Initialisation code} |
---|
317 | |
---|
318 | This replaces the initialisation data with explicit code in the main function. |
---|
319 | The only remarkable point in the formalization is that we have two slightly |
---|
320 | different instantiations of the \textsf{Cminor} syntax: one with |
---|
321 | initialisation data that this pass takes as input, and one with only size |
---|
322 | information that is the output. In addition to reflecting the purpose of this |
---|
323 | pass in its type, it also ensures that the pass cannot be accidentally omitted. |
---|
324 | |
---|
325 | \subsection{Conversion to \textsf{RTLabs}} |
---|
326 | |
---|
327 | This pass breaks down the structure of the \textsf{Cminor} program into a |
---|
328 | control flow graph, but maintains the same set of operations. The algorithm |
---|
329 | is stateful in the sense that it builds up the \textsf{RTLabs} function body |
---|
330 | incrementally, but all of the relevant state is already present in the |
---|
331 | function record (including the fresh register and graph label name generators) |
---|
332 | and the prototype passes this around. Thus the formalized code is very |
---|
333 | similar in nature. |
---|
334 | |
---|
335 | One possible variation would be to explicitly define a state monad to carry |
---|
336 | the function under construction around, but it is not yet clear if this will |
---|
337 | make the correctness results easier to prove. |
---|
338 | |
---|
339 | \section{Adding and using invariants} |
---|
340 | |
---|
341 | The compiler phases described above all use the error monad to deal with |
---|
342 | inconsistencies in the program being transformed. In particular, lookups in |
---|
343 | environments may fail, control flow graphs may have missing statements and |
---|
344 | various structural problems may be present. We would like to show that these |
---|
345 | failures are absent where possible by establishing that programs are well |
---|
346 | formed early in the compilation process. |
---|
347 | |
---|
348 | This work overlaps with deliverable 3.3 (where more details of the additions |
---|
349 | to the syntax and semantics of the intermediate languages can be found) and |
---|
350 | task 3.4 on the correctness of the compiler. Thus this work is experimental |
---|
351 | in nature, and will evolve during task 3.4. |
---|
352 | |
---|
353 | The use of the invariants follows a common pattern. Each language embeds |
---|
354 | invariants in the function record that constrain the function body by other |
---|
355 | information in the record (such as the list of local variables and types, or |
---|
356 | the set of labels declared). However, during the transformations they |
---|
357 | typically need to be refined to constraints on individual statements and |
---|
358 | expressions with respect to data structures used in the transformation. |
---|
359 | A similar change in invariants is required between the transformation and the |
---|
360 | new function. |
---|
361 | |
---|
362 | For example, consider the use of local variables in the \textsf{Cminor} to |
---|
363 | \textsf{RTLabs} stage. We start with |
---|
364 | \begin{lstlisting}[language=matita] |
---|
365 | record internal_function : Type[0] ≝ |
---|
366 | { f_return : option typ |
---|
367 | ; f_params : list (ident × typ) |
---|
368 | ; f_vars : list (ident × typ) |
---|
369 | ; f_stacksize : nat |
---|
370 | ; f_body : stmt |
---|
371 | ; f_inv : stmt_P (λs.stmt_vars (λi.Exists ? (λx.\fst x = i) (f_params @ f_vars)) s ∧ |
---|
372 | stmt_labels (λl.Exists ? (λl'.l' = l) (labels_of f_body)) s) f_body |
---|
373 | }. |
---|
374 | \end{lstlisting} |
---|
375 | where the first half of \lstinline'f_inv' requires every variable in the |
---|
376 | function body to appear in the parameter or variable list. In the translation |
---|
377 | to \textsf{RTLabs}, variable lookups are performed in a map to \textsf{RTLabs} |
---|
378 | pseudo-registers: |
---|
379 | \begin{lstlisting}[language=matita] |
---|
380 | definition env ≝ identifier_map SymbolTag register. |
---|
381 | |
---|
382 | let rec add_expr (le:label_env) (env:env) (ty:typ) (e:expr ty) |
---|
383 | (Env:expr_vars ty e (present ?? env)) |
---|
384 | (dst:register) (f:partial_fn le) on e |
---|
385 | : Σf':partial_fn le. fn_graph_included le f f' ≝ |
---|
386 | match e return λty,e.expr_vars ty e (present ?? env) → |
---|
387 | Σf':partial_fn le. fn_graph_included le f f' with |
---|
388 | [ Id _ i ⇒ λEnv. |
---|
389 | let r ≝ lookup_reg env i Env in |
---|
390 | ... |
---|
391 | \end{lstlisting} |
---|
392 | Note that \lstinline'lookup_reg' returns a register without any possibility of |
---|
393 | error. The reason this works is that the pattern match on \lstinline'e' |
---|
394 | refines the type of the invariant \lstinline'Env' to a proof that the variable |
---|
395 | \lstinline'i' is present. We then pass this proof to the lookup function to |
---|
396 | rule out failure. |
---|
397 | |
---|
398 | When this map \lstinline'env' is constructed at the start of the phase, we |
---|
399 | prove that the proof \lstinline'f_inv' from the function implies the invariant |
---|
400 | on variables needed by \lstinline'add_expr' and its equivalent on statements: |
---|
401 | \begin{lstlisting}[language=matita] |
---|
402 | lemma populates_env : ∀l,e,u,l',e',u'. |
---|
403 | populate_env e u l = 〈l',e',u'〉 → |
---|
404 | ∀i. Exists ? (λx.\fst x = i) l → |
---|
405 | present ?? e' i. |
---|
406 | \end{lstlisting} |
---|
407 | A similar mechanism is used to show that \texttt{goto} labels are always |
---|
408 | declared in the function. |
---|
409 | |
---|
410 | Also note the return type of \lstinline'add_expr' is a dependent pair. We |
---|
411 | build the resulting \textsf{RTLabs} function incrementally, using a type |
---|
412 | \lstinline'partial_fn' that does not contain the final invariant for |
---|
413 | functions. We always require the \lstinline'fn_graph_included' property for |
---|
414 | partially built functions to show that the graph only gets larger, a key part |
---|
415 | of the proof that the resulting control flow graph is closed. Dependent pairs |
---|
416 | are used in a similar manner in the \textsf{Clight} to \textsf{Cminor} phase |
---|
417 | too. |
---|
418 | |
---|
419 | This work does not currently cover all of the possible sources of failure; in |
---|
420 | particular some structural constraints on functions are not yet covered and |
---|
421 | some properties of \textsf{RTLabs} programs that may be useful for later |
---|
422 | stages or the correctness proofs are not produced. Moreover, we may |
---|
423 | experiment with variations to try to make the proof obligations and syntax |
---|
424 | simpler to deal with. However, it does show that retrofitting these |
---|
425 | properties using dependent types in \matita{} is feasible. |
---|
426 | |
---|
427 | \section{Testing} |
---|
428 | |
---|
429 | To provide some early testing and bug fixing of this code we constructed it in |
---|
430 | concert with the executable semantics described in deliverable 3.3, and |
---|
431 | \matita{} term pretty printers in the prototype compiler. Using these, we |
---|
432 | were able to test the phases individually and together by running programs |
---|
433 | within the proof assistant itself, and comparing the results with the expected |
---|
434 | output. |
---|
435 | |
---|
436 | \section{Conclusion} |
---|
437 | |
---|
438 | We have formalized the front-end of the \cerco{} compiler in the \matita{} |
---|
439 | proof assistant, and shown that invariants can be added to the intermediate |
---|
440 | languages to help show properties of it. This work provides a solid basis for |
---|
441 | the compiler correctness proofs in task 3.4. |
---|
442 | |
---|
443 | \bibliographystyle{plain} |
---|
444 | \bibliography{report} |
---|
445 | |
---|
446 | \end{document} |
---|