source: Deliverables/D3.2/Report/report.tex @ 1346

Last change on this file since 1346 was 1346, checked in by campbell, 8 years ago

Minor corrections and a paragraph of context in the abstract to deliverables 3.2 and 3.3.

File size: 19.1 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.2\\
70CIC encoding: Front-end\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We describe the translation of the front-end of the \cerco{} compiler from the
102\ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the
103\matita{} proof assistant.  This transforms programs in the C-like
104\textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably
105target-independent and in the form of a control flow graph.
106
107We also report on progress enriching these transformations with dependent
108types so as to establish key invariants in each intermediate language, which
109removes potential sources of failure within the compiler and
110will assist in future work proving correctness properties.
111
112This work was Task~3.2 of the \cerco{} project, translating the prototype
113described in Deliverable~2.2~\cite{d2.2} into CIC using the intermediate languages
114formalized in Deliverables~3.1~\cite{d3.1} and~3.3.  It will feed into the front-end
115correctness proofs in Task~3.4 and is a counterpart to the back-end
116formalization in Task~4.2.
117
118\newpage 
119
120\tableofcontents
121
122% TODO: clear up any -ize vs -ise
123% CHECK: clear up any "front end" vs "front-end"
124% CHECK: clear up any mentions of languages that aren't textsf'd.
125% CHECK: fix unicode in listings
126% CHEKC: capitalise deliverable/task when referring to a particular one
127
128\section{Introduction}
129
130The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but
131the certified compiler will be a program written in the Calculus of Inductive
132Constructions (CIC), as realised by the \matita{} proof assistant.  This
133deliverable reports on the translation of the front-end of the compiler into
134CIC and the subsequent efforts to start exploiting dependent types to maintain
135invariants and rule out potential sources of failure in the compiler.
136
137The input language for the formalized compiler is the \textsf{Clight}
138language.  This is a C-like language with side-effect free expressions that
139was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will
140also use their CIL-based C parser to generate \textsf{Clight} abstract syntax
141trees, but will not formalize this code.} and provided with an executable
142semantics.  See~\cite{d3.1} for more details on the syntax and semantics.
143
144\begin{figure}
145\begin{center}
146\begin{minipage}{.8\linewidth}
147\begin{tabbing}
148\quad \= $\downarrow$ \quad \= \kill
149\textsf{C} (unformalized)\\
150\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
151\textsf{Clight}\\
152\> $\downarrow$ \> cast removal\\
153\> $\downarrow$ \> add runtime functions\\
154\> $\downarrow$ \> labelling\\
155\> $\downarrow$ \> stack variable allocation and control structure
156 simplification\\
157\textsf{Cminor}\\
158\> $\downarrow$ \> generate global variable initialisation code\\
159\> $\downarrow$ \> transform to RTL graph\\
160\textsf{RTLabs}\\
161\> $\downarrow$ \> start of target specific back-end\\
162\>\quad \vdots
163\end{tabbing}
164\end{minipage}
165\end{center}
166\caption{Front-end languages and transformations}
167\label{fig:summary}
168\end{figure}
169
170The front-end of the compiler is summarised in Figure~\ref{fig:summary}.  The
171two intermediate languages involved are
172\begin{description}
173\item[\textsf{Cminor}] --- a C-like language where local variables are not
174explicitly allocated memory and control structures are simpler
175
176\item[\textsf{RTLabs}] --- a language in the form of a control flow graph
177which retains the values and front-end operations from \textsf{Cminor}
178\end{description}
179More details on the formalisation of the syntax and semantics of these
180languages can be found in the accompanying Deliverable 3.4.
181Development of the formalized front-end was conducted in concert with the
182development of these intermediate languages to facilitate testing.
183
184\subsection{Revisions to the prototype compiler}
185
186We have been tracking revisions to the prototype compiler during the
187development of the formalized version.  Most of these changes were minor, but
188one exception is a major change to the structure of the compiler.
189
190The original plan for the front-end featured a \textsf{Clight} to
191\textsf{Clight8} phase near the start which replaced all of the integer
192values and operations by 8 bit counterparts, while pointers were split into
193bytes at a later stage.  Experience has shown that it would be difficult to
194produce good code with this approach.  Instead, we now have:
195\begin{itemize}
196\item full size integers, pointers and operations until code selection (the
197first part of the back-end after \textsf{RTLabs}), and
198\item a cast removal stage which simplifies \textsf{Clight} expressions such
199as
200\begin{lstlisting}[language=C,belowskip=0pt]
201    (char)((int)x + (int)y)
202\end{lstlisting}
203into equivalent operations on simpler types, \lstinline'x+y' in this case.
204The cast removal is important because C requires \emph{arithmetic promotion}
205of integer types to (at least) \lstinline'int' before an operation is
206performed.  The \textsf{Clight} semantics do not perform the promotions,
207instead they are added as casts by the CIL-based parser.  However, our targets
208benefit immensely from performing operations on the smallest possible integer
209type, so it is important that we remove promotions where possible.
210\end{itemize}
211
212This document describes the formalized front-end after these changes.
213
214\section{\textsf{Clight} phases}
215
216In addition to the conversion to \textsf{Cminor}, there are several
217transformations which act directly on the \textsf{Clight} language.
218
219\subsection{Cast simplification}
220
221We noted above that the arithmetic promotion required by C (and implemented in
222the CIL-based parser) adds numerous casts, causing arithmetic operations to be
223performed on 32 bit integers.  If left alone, the resulting code will be
224larger and slower.  This phase removes many of the casts so that the
225operations can be performed more efficiently.
226
227The prototype version worked by recognising fixed patterns in the
228\textsf{Clight} abstract syntax tree such as
229\[ (t)((t_1)e_1\ op\ (t_2)e_2), \]
230subject to restrictions on the types.  These are replaced with a simpler
231version without the casts.  Such `deep' pattern matching is slightly awkward in
232\matita{} and this approach does not capture compositions of operations, such as
233\begin{lstlisting}[language=C]
234    (char)(((int)a + (int)b) + (int)c)
235\end{lstlisting}
236where \lstinline'a', \lstinline'b' and \lstinline'c' are of type
237\lstinline'char', because the intermediate expression is not cast to and from
238\lstinline'char'.
239
240The formalized version uses a different method, recursively examining each
241expression constructor to see if the expression can be coerced to some
242`desired' type.  For example, when processing the above expression it reaches
243each \lstinline'int' cast with a desired type of \lstinline'char', notes that
244the subexpression is of type \lstinline'char' and eliminates the cast.
245Moreover, when the recursive processing is complete the \lstinline'char' cast
246is also eliminated because its subexpression is already of the correct type.
247
248This has been implemented in \matita.  We have also performed a few proofs that
249the arithmetic behind these changes is correct to gain confidence in the
250technique.  During Task 3.4 we will extend these proofs to cover more
251operations and show that the semantics of the expressions are equivalent, not
252just the underlying arithmetic.
253
254\subsection{Labelling}
255
256This phase adds cost labels to the \textsf{Clight} program.  It is a fairly
257simple recursive definition, and was straightforward to port to \matita.  The
258generation of cost labels was handled by our generic identifiers code,
259described in the accompanying Deliverable 3.3 on intermediate languages.
260
261\subsection{Runtime functions}
262
263Some operations on integers do not have a simple translation to the target
264machine code.  In particular, we need to replace operations for 16 and 32-bit
265division and most bitwise shifts with calls to runtime functions.  These
266functions need to be added to the program at an early stage because of their
267impact on execution time:  any loops must be available to our labelling
268mechanism so that we can report on how long the resulting machine code will
269take to execute.
270
271We follow the prototype in replacing the affected expressions, which requires
272us to break up expressions into multiple statements because function calls are
273not permitted in \textsf{Clight} expressions.  We may investigate moving these
274substitutions to a later stage of the compiler if they prove difficult to
275reason about.  However, this would also require adjusting the semantics so
276that the costs still appear in the evaluation of \textsf{Clight} programs.
277
278The prototype adds the functions themselves by generating C code as text and
279reparsing the program.  This is unsuitable for formalization, so we generate
280\textsf{Clight} abstract syntax trees directly.
281
282\subsection{Conversion to \textsf{Cminor}}
283
284The conversion to \textsf{Cminor} performs two essential tasks.  First, it
285determines which local variables need to be stored in memory and generates
286explicit memory accesses for them.  Second, it must translate the control
287structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s
288more basic structures.
289
290These are both performed by code similar to that in the prototype, although
291the use of generic fold operations on statements and expressions has been
292replaced by simpler recursive definitions.
293
294There are two additional pieces of work that the formalized translation must
295do.  The \textsf{Cminor} definition features some mild constraints of the
296types of expressions, which we can enforce in the translation using some type
297checking.  The error monad is used to dispose of ill-typed \textsf{Clight}
298programs.
299
300The other difficulty is that we need to generate fresh temporary variables to
301store function results in before they are written to memory.  This is
302necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions
303as the destination for the returned value, but \textsf{Cminor} only allows
304local variables.  All other variable names in the \textsf{Cminor} program came
305from the \textsf{Clight} program, but we need to construct a method for
306generating fresh names for the temporaries.
307
308Our identifiers are based on binary numbers, and generation of fresh names is
309handled by keeping track of the highest allocated number.  Normally this is
310initialised at zero, but if initialised by the largest existing identifier in
311the \textsf{Clight} program then the generated names will be fresh.
312To do this, we extract the maximum identifier by recursively finding the maximum
313variable name used in every expression, statement and function of the program.
314
315\section{\textsf{Cminor} phases}
316
317\textsf{Cminor} programs are processed by two passes: one deals with the
318initialisation of global variables, and the other produces \textsf{RTLabs}
319code.
320
321\subsection{Initialisation code}
322
323This replaces the initialisation data with explicit code in the main function.
324The only remarkable point in the formalization is that we have two slightly
325different instantiations of the \textsf{Cminor} syntax: one with
326initialisation data that this pass takes as input, and one with only size
327information that is the output.  In addition to reflecting the purpose of this
328pass in its type, it also ensures that the pass cannot be accidentally omitted.
329
330\subsection{Conversion to \textsf{RTLabs}}
331
332This pass breaks down the structure of the \textsf{Cminor} program into a
333control flow graph, but maintains the same set of operations.  The algorithm
334is stateful in the sense that it builds up the \textsf{RTLabs} function body
335incrementally, but all of the relevant state is already present in the
336function record (including the fresh register and graph label name generators)
337and the prototype passes this around.  Thus the formalized code is very
338similar in nature.
339
340One possible variation would be to explicitly define a state monad to carry
341the function under construction around, but it is not yet clear if this will
342make the correctness results easier to prove.
343
344\section{Adding and using invariants}
345
346The compiler phases described above all use the error monad to deal with
347inconsistencies in the program being transformed.  In particular, lookups in
348environments may fail, control flow graphs may have missing statements and
349various structural problems may be present.  We would like to show that these
350failures are absent where possible by establishing that programs are well
351formed early in the compilation process.
352
353This work overlaps with Deliverable 3.3 (where more details of the additions
354to the syntax and semantics of the intermediate languages can be found) and
355Task 3.4 on the correctness of the compiler.  Thus this work is experimental
356in nature, and will evolve during Task 3.4.
357
358The use of the invariants follows a common pattern.  Each language embeds
359invariants in the function record that constrain the function body by other
360information in the record (such as the list of local variables and types, or
361the set of labels declared).  However, during the transformations they
362typically need to be refined to constraints on individual statements and
363expressions with respect to data structures used in the transformation.
364A similar change in invariants is required between the transformation and the
365new function.
366
367For example, consider the use of local variables in the \textsf{Cminor} to
368\textsf{RTLabs} stage.  We start with
369\begin{lstlisting}[language=matita]
370record internal_function : Type[0] ≝
371{ f_return    : option typ
372; f_params    : list (ident $\times$ typ)
373; f_vars      : list (ident $\times$ typ)
374; f_stacksize : nat
375; f_body      : stmt
376; f_inv       : stmt_P ($\lambda$s.stmt_vars ($\lambda$i.Exists ? ($\lambda$x.\fst x = i) (f_params @ f_vars)) s $\wedge$
377                           stmt_labels ($\lambda$l.Exists ? ($\lambda$l'.l' = l) (labels_of f_body)) s) f_body
378}.
379\end{lstlisting}
380where the first half of \lstinline'f_inv' requires every variable in the
381function body to appear in the parameter or variable list.  In the translation
382to \textsf{RTLabs}, variable lookups are performed in a map to \textsf{RTLabs}
383pseudo-registers:
384\begin{lstlisting}[language=matita]
385definition env ≝ identifier_map SymbolTag register.
386
387let rec add_expr (le:label_env) (env:env) (ty:typ) (e:expr ty)
388                   (Env:expr_vars ty e (present ?? env))
389                   (dst:register) (f:partial_fn le) on e
390     : $\Sigma$f':partial_fn le. fn_graph_included le f f' ≝
391match e return $\lambda$ty,e.expr_vars ty e (present ?? env) $\rightarrow$
392                 $\Sigma$f':partial_fn le. fn_graph_included le f f' with
393[ Id _ i $\Rightarrow$ $\lambda$Env.
394    let r ≝ lookup_reg env i Env in
395    ...
396\end{lstlisting}
397Note that \lstinline'lookup_reg' returns a register without any possibility of
398error.  The reason this works is that the pattern match on \lstinline'e'
399refines the type of the invariant \lstinline'Env' to a proof that the variable
400\lstinline'i' is present.  We then pass this proof to the lookup function to
401rule out failure.
402
403When this map \lstinline'env' is constructed at the start of the phase, we
404prove that the proof \lstinline'f_inv' from the function implies the invariant
405on variables needed by \lstinline'add_expr' and its equivalent on statements:
406\begin{lstlisting}[language=matita]
407lemma populates_env : $\forall$l,e,u,l',e',u'.
408  populate_env e u l = $\langle$l',e',u'$\rangle$ $\rightarrow$
409  $\forall$i. Exists ? ($\lambda$x.\fst x = i) l $\rightarrow$
410      present ?? e' i.
411\end{lstlisting}
412A similar mechanism is used to show that \texttt{goto} labels are always
413declared in the function.
414
415Also note the return type of \lstinline'add_expr' is a dependent pair.  We
416build the resulting \textsf{RTLabs} function incrementally, using a type
417\lstinline'partial_fn' that does not contain the final invariant for
418functions.  We always require the \lstinline'fn_graph_included' property for
419partially built functions to show that the graph only gets larger, a key part
420of the proof that the resulting control flow graph is closed.  Dependent pairs
421are used in a similar manner in the \textsf{Clight} to \textsf{Cminor} phase
422too.
423
424This work does not currently cover all of the possible sources of failure; in
425particular some structural constraints on functions are not yet covered and
426some properties of \textsf{RTLabs} programs that may be useful for later
427stages or the correctness proofs are not produced.  Moreover, we may
428experiment with variations to try to make the proof obligations and syntax
429simpler to deal with.  However, it does show that retrofitting these
430properties using dependent types in \matita{} is feasible.
431
432\section{Testing}
433
434To provide some early testing and bug fixing of this code we constructed it in
435concert with the executable semantics described in Deliverable 3.3, and
436\matita{} term pretty printers in the prototype compiler.  Using these, we
437were able to test the phases individually and together by running programs
438within the proof assistant itself, and comparing the results with the expected
439output.
440
441\section{Conclusion}
442
443We have formalized the front-end of the \cerco{} compiler in the \matita{}
444proof assistant, and shown that invariants can be added to the intermediate
445languages to help show properties of it.  This work provides a solid basis for
446the compiler correctness proofs in Task 3.4.
447
448\bibliographystyle{plain}
449\bibliography{report}
450
451\end{document}
Note: See TracBrowser for help on using the repository browser.