source: Deliverables/D3.2/Report/report.tex @ 1277

Last change on this file since 1277 was 1277, checked in by campbell, 8 years ago

Runtime functions and conclusion for D3.2.

File size: 18.7 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.2\\
70CIC encoding: Front-end\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We describe the translation of the front end of the \cerco{} compiler from the
102\ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the
103\matita{} proof assistant.  This transforms programs in the C-like
104\textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably
105target-independent and in the form of a control flow graph.
106
107We also report on progress enriching these transformations with dependent
108types so as to establish key invariants in each intermediate language, which
109removes potential sources of failure within the compiler and
110will assist in future work proving correctness properties.
111
112\newpage 
113
114\tableofcontents
115
116% TODO: clear up any -ize vs -ise
117% TODO: clear up any "front end" vs "front-end"
118% TODO: clear up any mentions of languages that aren't textsf'd.
119% TODO: fix unicode in listings
120% TODO: make it clear that our correctness is intended to go beyond compcert's
121% TODO: capitalise deliverable/task?
122
123\section{Introduction}
124
125The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but
126the certified compiler will be a program written in the Calculus of Inductive
127Constructions (CIC), as realised by the \matita{} proof assistant.  This
128deliverable reports on the translation of the front-end of the compiler into
129CIC and the subsequent efforts to start exploiting dependent types to maintain
130invariants and rule out potential sources of failure in the compiler.
131
132The input language for the formalized compiler is the \textsf{Clight}
133language.  This is a C-like language with side-effect free expressions that
134was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will
135also use their CIL-based C parser to generate \textsf{Clight} abstract syntax
136trees, but will not formalize this code.} and provided with an executable
137semantics.  See~\cite{d3.1} for more details on the syntax and semantics.
138
139\begin{figure}
140\begin{center}
141\begin{minipage}{.8\linewidth}
142\begin{tabbing}
143\quad \= $\downarrow$ \quad \= \kill
144\textsf{C} (unformalized)\\
145\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
146\textsf{Clight}\\
147\> $\downarrow$ \> cast removal\\
148\> $\downarrow$ \> add runtime functions\\
149\> $\downarrow$ \> labelling\\
150\> $\downarrow$ \> stack variable allocation and control structure
151 simplification\\
152\textsf{Cminor}\\
153\> $\downarrow$ \> generate global variable initialisation code\\
154\> $\downarrow$ \> transform to RTL graph\\
155\textsf{RTLabs}\\
156\> $\downarrow$ \> start of target specific backend\\
157\>\quad \vdots
158\end{tabbing}
159\end{minipage}
160\end{center}
161\caption{Front-end languages and transformations}
162\label{fig:summary}
163\end{figure}
164
165The front-end of the compiler is summarised in Figure~\ref{fig:summary}.  The
166two intermediate languages involved are
167\begin{description}
168\item[\textsf{Cminor}] --- a C-like language where local variables are not
169explicitly allocated memory and control structures are simpler
170
171\item[\textsf{RTLabs}] --- a language in the form of a control flow graph
172which retains the values and front end operations from \textsf{Cminor}
173\end{description}
174More details on the formalisation of the syntax and semantics of these
175languages can be found in the accompanying deliverable 3.4.
176Development of the formalized front-end was conducted in concert with the
177development of these intermediate languages to facilitate testing.
178
179\subsection{Revisions to the prototype compiler}
180
181We have been tracking revisions to the prototype compiler during the
182development of the formalized version.  Most of these changes were minor, but
183one exception is a major change to the structure of the compiler.
184
185The original plan for the front-end featured a \textsf{Clight} to
186\textsf{Clight8} phase near the start which replaced all of the integer
187values and operations by 8 bit counterparts, while pointers were split into
188bytes at a later stage.  Experience has shown that it would be difficult to
189produce good code with this approach.  Instead, we now have:
190\begin{itemize}
191\item full size integers, pointers and operations until code selection (the
192first part of the back-end after \textsf{RTLabs}), and
193\item a cast removal stage which simplifies \textsf{Clight} expressions such
194as
195\begin{lstlisting}[language=C,belowskip=0pt]
196    (char)((int)x + (int)y)
197\end{lstlisting}
198into equivalent operations on simpler types, \lstinline'x+y' in this case.
199The cast removal is important because C requires \emph{arithmetic promotion}
200of integer types to (at least) \lstinline'int' before an operation is
201performed.  The \textsf{Clight} semantics do not perform the promotions,
202instead they are added as casts by the CIL-based parser.  However, our targets
203benefit immensely from performing operations on the smallest possible integer
204type, so it is important that we remove promotions where possible.
205\end{itemize}
206
207This document describes the formalized front end after these changes.
208
209\section{\textsf{Clight} phases}
210
211In addition to the conversion to \textsf{Cminor}, there are several
212transformations which act directly on the \textsf{Clight} language.
213
214\subsection{Cast simplification}
215
216We noted above that the arithmetic promotion required by C (and implemented in
217the CIL-based parser) adds numerous casts, causing arithmetic operations to be
218performed on 32 bit integers.  If left alone, the resulting code will be
219larger and slower.  This phase removes many of the casts so that the
220operations can be performed more efficiently.
221
222The prototype version worked by recognising fixed patterns in the
223\textsf{Clight} abstract syntax tree such as
224\[ (t)((t_1)e_1\ op\ (t_2)e_2), \]
225subject to restrictions on the types.  These are replaced with a simpler
226version without the casts.  Such `deep' pattern matching is slightly awkward in
227\matita{} and this approach does not capture compositions of operations, such as
228\begin{lstlisting}[language=C]
229    (char)(((int)a + (int)b) + (int)c)
230\end{lstlisting}
231where \lstinline'a', \lstinline'b' and \lstinline'c' are of type
232\lstinline'char', because the intermediate expression is not cast to and from
233\lstinline'char'.
234
235The formalized version uses a different method, recursively examining each
236expression constructor to see if the expression can be coerced to some
237`desired' type.  For example, when processing the above expression it reaches
238each \lstinline'int' cast with a desired type of \lstinline'char', notes that
239the subexpression is of type \lstinline'char' and eliminates the cast.
240Moreover, when the recursive processing is complete the \lstinline'char' cast
241is also eliminated because its subexpression is already of the correct type.
242
243This has been implemented in \matita.  We have also performed a few proofs that
244the arithmetic behind these changes is correct to gain confidence in the
245technique.  During task 3.4 we will extend these proofs to cover more
246operations and show that the semantics of the expressions are equivalent, not
247just the underlying arithmetic.
248
249\subsection{Labelling}
250
251This phase adds cost labels to the \textsf{Clight} program.  It is a fairly
252simple recursive definition, and was straightforward to port to \matita.  The
253generation of cost labels was handled by our generic identifiers code,
254described in the accompanying deliverable 3.3 on intermediate languages.
255
256\subsection{Runtime functions}
257
258Some operations on integers do not have a simple translation to the target
259machine code.  In particular, we need to replace operations for 16 and 32-bit
260division and most bitwise shifts with calls to runtime functions.  These
261functions need to be added to the program at an early stage because of their
262impact on execution time:  any loops must be available to our labelling
263mechanism so that we can report on how long the resulting machine code will
264take to execute.
265
266We follow the prototype in replacing the affected expressions, which requires
267us to break up expressions into multiple statements because function calls are
268not permitted in \textsf{Clight} expressions.  We may investigate moving these
269substitutions to a later stage of the compiler if they prove difficult to
270reason about.  However, this would also require adjusting the semantics so
271that the costs still appear in the evaluation of \textsf{Clight} programs.
272
273The prototype adds the functions themselves by generating C code as text and
274reparsing the program.  This is unsuitable for formalization, so we generate
275\textsf{Clight} abstract syntax trees directly.
276
277\subsection{Conversion to Cminor}
278
279The conversion to \textsf{Cminor} performs two essential tasks.  First, it
280determines which local variables need to be stored in memory and generates
281explicit memory accesses for them.  Second, it must translate the control
282structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s
283more basic structures.
284
285These are both performed by code similar to that in the prototype, although
286the use of generic fold operations on statements and expressions has been
287replaced by simpler recursive definitions.
288
289There are two additional pieces of work that the formalized translation must
290do.  The \textsf{Cminor} definition features some mild constraints of the
291types of expressions, which we can enforce in the translation using some type
292checking.  The error monad is used to dispose of ill-typed \textsf{Clight}
293programs.
294
295The other difficulty is that we need to generate fresh temporary variables to
296store function results in before they are written to memory.  This is
297necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions
298as the destination for the returned value, but \textsf{Cminor} only allows
299local variables.  All other variable names in the \textsf{Cminor} program came
300from the \textsf{Clight} program, but we need to construct a method for
301generating fresh names for the temporaries.
302
303Our identifiers are based on binary numbers, and generation of fresh names is
304handled by keeping track of the highest allocated number.  Normally this is
305initialised at zero, but if initialised by the largest existing identifier in
306the \textsf{Clight} program then the generated names will be fresh.
307To do this, we extract the maximum identifier by recursively finding the maximum
308variable name used in every expression, statement and function of the program.
309
310\section{\textsf{Cminor} phases}
311
312\textsf{Cminor} programs are processed by two passes: one deals with the
313initialisation of global variables, and the other produces \textsf{RTLabs}
314code.
315
316\subsection{Initialisation code}
317
318This replaces the initialisation data with explicit code in the main function.
319The only remarkable point in the formalization is that we have two slightly
320different instantiations of the \textsf{Cminor} syntax: one with
321initialisation data that this pass takes as input, and one with only size
322information that is the output.  In addition to reflecting the purpose of this
323pass in its type, it also ensures that the pass cannot be accidentally omitted.
324
325\subsection{Conversion to \textsf{RTLabs}}
326
327This pass breaks down the structure of the \textsf{Cminor} program into a
328control flow graph, but maintains the same set of operations.  The algorithm
329is stateful in the sense that it builds up the \textsf{RTLabs} function body
330incrementally, but all of the relevant state is already present in the
331function record (including the fresh register and graph label name generators)
332and the prototype passes this around.  Thus the formalized code is very
333similar in nature.
334
335One possible variation would be to explicitly define a state monad to carry
336the function under construction around, but it is not yet clear if this will
337make the correctness results easier to prove.
338
339\section{Adding and using invariants}
340
341The compiler phases described above all use the error monad to deal with
342inconsistencies in the program being transformed.  In particular, lookups in
343environments may fail, control flow graphs may have missing statements and
344various structural problems may be present.  We would like to show that these
345failures are absent where possible by establishing that programs are well
346formed early in the compilation process.
347
348This work overlaps with deliverable 3.3 (where more details of the additions
349to the syntax and semantics of the intermediate languages can be found) and
350task 3.4 on the correctness of the compiler.  Thus this work is experimental
351in nature, and will evolve during task 3.4.
352
353The use of the invariants follows a common pattern.  Each language embeds
354invariants in the function record that constrain the function body by other
355information in the record (such as the list of local variables and types, or
356the set of labels declared).  However, during the transformations they
357typically need to be refined to constraints on individual statements and
358expressions with respect to data structures used in the transformation.
359A similar change in invariants is required between the transformation and the
360new function.
361
362For example, consider the use of local variables in the \textsf{Cminor} to
363\textsf{RTLabs} stage.  We start with
364\begin{lstlisting}[language=matita]
365record internal_function : Type[0] ≝
366{ f_return    : option typ
367; f_params    : list (ident × typ)
368; f_vars      : list (ident × typ)
369; f_stacksize : nat
370; f_body      : stmt
371; f_inv       : stmt_P (λs.stmt_vars (λi.Exists ? (λx.\fst x = i) (f_params @ f_vars)) s ∧
372                           stmt_labels (λl.Exists ? (λl'.l' = l) (labels_of f_body)) s) f_body
373}.
374\end{lstlisting}
375where the first half of \lstinline'f_inv' requires every variable in the
376function body to appear in the parameter or variable list.  In the translation
377to \textsf{RTLabs}, variable lookups are performed in a map to \textsf{RTLabs}
378pseudo-registers:
379\begin{lstlisting}[language=matita]
380definition env ≝ identifier_map SymbolTag register.
381
382let rec add_expr (le:label_env) (env:env) (ty:typ) (e:expr ty)
383                   (Env:expr_vars ty e (present ?? env))
384                   (dst:register) (f:partial_fn le) on e
385     : Σf':partial_fn le. fn_graph_included le f f' ≝
386match e return λty,e.expr_vars ty e (present ?? env) →
387                 Σf':partial_fn le. fn_graph_included le f f' with
388[ Id _ i ⇒ λEnv.
389    let r ≝ lookup_reg env i Env in
390    ...
391\end{lstlisting}
392Note that \lstinline'lookup_reg' returns a register without any possibility of
393error.  The reason this works is that the pattern match on \lstinline'e'
394refines the type of the invariant \lstinline'Env' to a proof that the variable
395\lstinline'i' is present.  We then pass this proof to the lookup function to
396rule out failure.
397
398When this map \lstinline'env' is constructed at the start of the phase, we
399prove that the proof \lstinline'f_inv' from the function implies the invariant
400on variables needed by \lstinline'add_expr' and its equivalent on statements:
401\begin{lstlisting}[language=matita]
402lemma populates_env : ∀l,e,u,l',e',u'.
403  populate_env e u l = 〈l',e',u'〉 →
404  ∀i. Exists ? (λx.\fst x = i) l →
405      present ?? e' i.
406\end{lstlisting}
407A similar mechanism is used to show that \texttt{goto} labels are always
408declared in the function.
409
410Also note the return type of \lstinline'add_expr' is a dependent pair.  We
411build the resulting \textsf{RTLabs} function incrementally, using a type
412\lstinline'partial_fn' that does not contain the final invariant for
413functions.  We always require the \lstinline'fn_graph_included' property for
414partially built functions to show that the graph only gets larger, a key part
415of the proof that the resulting control flow graph is closed.  Dependent pairs
416are used in a similar manner in the \textsf{Clight} to \textsf{Cminor} phase
417too.
418
419This work does not currently cover all of the possible sources of failure; in
420particular some structural constraints on functions are not yet covered and
421some properties of \textsf{RTLabs} programs that may be useful for later
422stages or the correctness proofs are not produced.  Moreover, we may
423experiment with variations to try to make the proof obligations and syntax
424simpler to deal with.  However, it does show that retrofitting these
425properties using dependent types in \matita{} is feasible.
426
427\section{Testing}
428
429To provide some early testing and bug fixing of this code we constructed it in
430concert with the executable semantics described in deliverable 3.3, and
431\matita{} term pretty printers in the prototype compiler.  Using these, we
432were able to test the phases individually and together by running programs
433within the proof assistant itself, and comparing the results with the expected
434output.
435
436\section{Conclusion}
437
438We have formalized the front-end of the \cerco{} compiler in the \matita{}
439proof assistant, and shown that invariants can be added to the intermediate
440languages to help show properties of it.  This work provides a solid basis for
441the compiler correctness proofs in task 3.4.
442
443\bibliographystyle{plain}
444\bibliography{report}
445
446\end{document}
Note: See TracBrowser for help on using the repository browser.