source: Deliverables/D3.2/Report/report.tex @ 1261

Last change on this file since 1261 was 1261, checked in by campbell, 8 years ago

More of D3.2.

File size: 17.3 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.2\\
70CIC encoding: Front-end\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We describe the translation of the front end of the \cerco{} compiler from the
102\ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the
103\matita{} proof assistant.  This transforms programs in the C-like
104\textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably
105target-independent and in the form of a control flow graph.
106
107We also report on progress enriching these transformations with dependent
108types so as to establish key invariants in each intermediate language, which
109removes potential sources of failure within the compiler and
110will assist in future work proving correctness properties.
111
112\newpage 
113
114\tableofcontents
115
116% TODO: clear up any -ize vs -ise
117% TODO: clear up any "front end" vs "front-end"
118% TODO: clear up any mentions of languages that aren't textsf'd.
119% TODO: fix unicode in listings
120% TODO: make it clear that our correctness is intended to go beyond compcert's
121% TODO: capitalise deliverable/task?
122
123\section{Introduction}
124
125The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but
126the certified compiler will be a program written in the Calculus of Inductive
127Constructions (CIC), as realised by the \matita{} proof assistant.  This
128deliverable reports on the translation of the front-end of the compiler into
129CIC and the subsequent efforts to start exploiting dependent types to maintain
130invariants and rule out potential sources of failure in the compiler.
131
132The input language for the formalized compiler is the \textsf{Clight}
133language.  This is a C-like language with side-effect free expressions that
134was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will
135also use their CIL-based C parser to generate \textsf{Clight} abstract syntax
136trees, but will not formalize this code.} and provided with an executable
137semantics.  See~\cite{d3.1} for more details on the syntax and semantics.
138
139\begin{figure}
140\begin{center}
141\begin{minipage}{.8\linewidth}
142\begin{tabbing}
143\quad \= $\downarrow$ \quad \= \kill
144\textsf{C} (unformalized)\\
145\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
146\textsf{Clight}\\
147\> $\downarrow$ \> cast removal\\
148\> $\downarrow$ \> add runtime functions\\
149\> $\downarrow$ \> labelling\\
150\> $\downarrow$ \> stack variable allocation and control structure
151 simplification\\
152\textsf{Cminor}\\
153\> $\downarrow$ \> generate global variable initialisation code\\
154\> $\downarrow$ \> transform to RTL graph\\
155\textsf{RTLabs}\\
156\> $\downarrow$ \> start of target specific backend\\
157\>\quad \vdots
158\end{tabbing}
159\end{minipage}
160\end{center}
161\caption{Front-end languages and transformations}
162\label{fig:summary}
163\end{figure}
164
165The front-end of the compiler is summarised in Figure~\ref{fig:summary}.  The
166two intermediate languages involved are
167\begin{description}
168\item[\textsf{Cminor}] --- a C-like language where local variables are not
169explicitly allocated memory and control structures are simpler
170
171\item[\textsf{RTLabs}] --- a language in the form of a control flow graph
172which retains the values and front end operations from \textsf{Cminor}
173\end{description}
174More details on the formalisation of the syntax and semantics of these
175languages can be found in the accompanying deliverable 3.4.
176Development of the formalized front-end was conducted in concert with the
177development of these intermediate languages to facilitate testing.
178
179\subsection{Revisions to the prototype compiler}
180
181We have been tracking revisions to the prototype compiler during the
182development of the formalized version.  Most of these changes were minor, but
183one exception is a major change to the structure of the compiler.
184
185The original plan for the front-end featured a \textsf{Clight} to
186\textsf{Clight8} phase near the start which replaced all of the integer
187values and operations by 8 bit counterparts, while pointers were split into
188bytes at a later stage.  Experience has shown that it would be difficult to
189produce good code with this approach.  Instead, we now have:
190\begin{itemize}
191\item full size integers, pointers and operations until code selection (the
192first part of the back-end after \textsf{RTLabs}), and
193\item a cast removal stage which simplifies \textsf{Clight} expressions such
194as
195\begin{lstlisting}[language=C,belowskip=0pt]
196    (char)((int)x + (int)y)
197\end{lstlisting}
198into equivalent operations on simpler types, \lstinline'x+y' in this case.
199The cast removal is important because C requires \emph{arithmetic promotion}
200of integer types to (at least) \lstinline'int' before an operation is
201performed.  The \textsf{Clight} semantics do not perform the promotions,
202instead they are added as casts by the CIL-based parser.  However, our targets
203benefit immensely from performing operations on the smallest possible integer
204type, so it is important that we remove promotions where possible.
205\end{itemize}
206
207This document describes the formalized front end after these changes.
208
209\section{\textsf{Clight} phases}
210
211In addition to the conversion to \textsf{Cminor}, there are several
212transformations which act directly on the \textsf{Clight} language.
213
214\subsection{Cast simplification}
215
216We noted above that the arithmetic promotion required by C (and implemented in
217the CIL-based parser) adds numerous casts, causing arithmetic operations to be
218performed on 32 bit integers.  If left alone, the resulting code will be
219larger and slower.  This phase removes many of the casts so that the
220operations can be performed more efficiently.
221
222The prototype version worked by recognising fixed patterns in the
223\textsf{Clight} abstract syntax tree such as
224\[ (t)((t_1)e_1\ op\ (t_2)e_2), \]
225subject to restrictions on the types.  These are replaced with a simpler
226version without the casts.  Such `deep' pattern matching is slightly awkward in
227\matita{} and this approach does not capture compositions of operations, such as
228\begin{lstlisting}[language=C]
229    (char)(((int)a + (int)b) + (int)c)
230\end{lstlisting}
231where \lstinline'a', \lstinline'b' and \lstinline'c' are of type
232\lstinline'char', because the intermediate expression is not cast to and from
233\lstinline'char'.
234
235The formalized version uses a different method, recursively examining each
236expression constructor to see if the expression can be coerced to some
237`desired' type.  For example, when processing the above expression it reaches
238each \lstinline'int' cast with a desired type of \lstinline'char', notes that
239the subexpression is of type \lstinline'char' and eliminates the cast.
240Moreover, when the recursive processing is complete the \lstinline'char' cast
241is also eliminated because its subexpression is already of the correct type.
242
243This has been implemented in \matita.  We have also performed a few proofs that
244the arithmetic behind these changes is correct to gain confidence in the
245technique.  During task 3.4 we will extend these proofs to cover more
246operations and show that the semantics of the expressions are equivalent, not
247just the underlying arithmetic.
248
249\subsection{Labelling}
250
251This phase adds cost labels to the \textsf{Clight} program.  It is a fairly
252simple recursive definition, and was straightforward to port to \matita.  The
253generation of cost labels was handled by our generic identifiers code,
254described in the accompanying deliverable 3.3 on intermediate languages.
255
256\subsection{Runtime functions}
257% TODO: this hasn't been implemented yet
258
259\subsection{Conversion to Cminor}
260
261The conversion to \textsf{Cminor} performs two essential tasks.  First, it
262determines which local variables need to be stored in memory and generates
263explicit memory accesses for them.  Second, it must translate the control
264structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s
265more basic structures.
266
267These are both performed by code similar to that in the prototype, although
268the use of generic fold operations on statements and expressions has been
269replaced by simpler recursive definitions.
270
271There are two additional pieces of work that the formalized translation must
272do.  The \textsf{Cminor} definition features some mild constraints of the
273types of expressions, which we can enforce in the translation using some type
274checking.  The error monad is used to dispose of ill-typed \textsf{Clight}
275programs.
276
277The other difficulty is that we need to generate fresh temporary variables to
278store function results in before they are written to memory.  This is
279necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions
280as the destination for the returned value, but \textsf{Cminor} only allows
281local variables.  All other variable names in the \textsf{Cminor} program came
282from the \textsf{Clight} program, but we need to construct a method for
283generating fresh names for the temporaries.
284
285Our identifiers are based on binary numbers, and generation of fresh names is
286handled by keeping track of the highest allocated number.  Normally this is
287initialised at zero, but if initialised by the largest existing identifier in
288the \textsf{Clight} program then the generated names will be fresh.
289To do this, we extract the maximum identifier by recursively finding the maximum
290variable name used in every expression, statement and function of the program.
291
292\section{\textsf{Cminor} phases}
293
294\textsf{Cminor} programs are processed by two passes: one deals with the
295initialisation of global variables, and the other produces \textsf{RTLabs}
296code.
297
298\subsection{Initialisation code}
299
300This replaces the initialisation data with explicit code in the main function.
301The only remarkable point in the formalization is that we have two slightly
302different instantiations of the \textsf{Cminor} syntax: one with
303initialisation data that this pass takes as input, and one with only size
304information that is the output.  In addition to reflecting the purpose of this
305pass in its type, it also ensures that the pass cannot be accidentally omitted.
306
307\subsection{Conversion to \textsf{RTLabs}}
308
309This pass breaks down the structure of the \textsf{Cminor} program into a
310control flow graph, but maintains the same set of operations.  The algorithm
311is stateful in the sense that it builds up the \textsf{RTLabs} function body
312incrementally, but all of the relevant state is already present in the
313function record (including the fresh register and graph label name generators)
314and the prototype passes this around.  Thus the formalized code is very
315similar in nature.
316
317One possible variation would be to explicitly define a state monad to carry
318the function under construction around, but it is not yet clear if this will
319make the correctness results easier to prove.
320
321\section{Adding and using invariants}
322
323The compiler phases described above all use the error monad to deal with
324inconsistencies in the program being transformed.  In particular, lookups in
325environments may fail, control flow graphs may have missing statements and
326various structural problems may be present.  We would like to show that these
327failures are absent where possible by establishing that programs are well
328formed early in the compilation process.
329
330This work overlaps with deliverable 3.3 (where more details of the additions
331to the syntax and semantics of the intermediate languages can be found) and
332task 3.4 on the correctness of the compiler.  Thus this work is experimental
333in nature, and will evolve during task 3.4.
334
335The use of the invariants follows a common pattern.  Each language embeds
336invariants in the function record that constrain the function body by other
337information in the record (such as the list of local variables and types, or
338the set of labels declared).  However, during the transformations they
339typically need to be refined to constraints on individual statements and
340expressions with respect to data structures used in the transformation.
341A similar change in invariants is required between the transformation and the
342new function.
343
344For example, consider the use of local variables in the \textsf{Cminor} to
345\textsf{RTLabs} stage.  We start with
346\begin{lstlisting}[language=matita]
347record internal_function : Type[0] ≝
348{ f_return    : option typ
349; f_params    : list (ident × typ)
350; f_vars      : list (ident × typ)
351; f_stacksize : nat
352; f_body      : stmt
353; f_inv       : stmt_P (λs.stmt_vars (λi.Exists ? (λx.\fst x = i) (f_params @ f_vars)) s ∧
354                           stmt_labels (λl.Exists ? (λl'.l' = l) (labels_of f_body)) s) f_body
355}.
356\end{lstlisting}
357where the first half of \lstinline'f_inv' requires every variable in the
358function body to appear in the parameter or variable list.  In the translation
359to \textsf{RTLabs}, variable lookups are performed in a map to \textsf{RTLabs}
360pseudo-registers:
361\begin{lstlisting}[language=matita]
362definition env ≝ identifier_map SymbolTag register.
363
364let rec add_expr (le:label_env) (env:env) (ty:typ) (e:expr ty)
365                   (Env:expr_vars ty e (present ?? env))
366                   (dst:register) (f:partial_fn le) on e
367     : Σf':partial_fn le. fn_graph_included le f f' ≝
368match e return λty,e.expr_vars ty e (present ?? env) →
369                 Σf':partial_fn le. fn_graph_included le f f' with
370[ Id _ i ⇒ λEnv.
371    let r ≝ lookup_reg env i Env in
372    ...
373\end{lstlisting}
374Note that \lstinline'lookup_reg' returns a register without any possibility of
375error.  The reason this works is that the pattern match on \lstinline'e'
376refines the type of the invariant \lstinline'Env' to a proof that the variable
377\lstinline'i' is present.  We then pass this proof to the lookup function to
378rule out failure.
379
380When this map \lstinline'env' is constructed at the start of the phase, we
381prove that the proof \lstinline'f_inv' from the function implies the invariant
382on variables needed by \lstinline'add_expr' and its equivalent on statements:
383\begin{lstlisting}[language=matita]
384lemma populates_env : ∀l,e,u,l',e',u'.
385  populate_env e u l = 〈l',e',u'〉 →
386  ∀i. Exists ? (λx.\fst x = i) l →
387      present ?? e' i.
388\end{lstlisting}
389A similar mechanism is used to show that \texttt{goto} labels are always
390declared in the function.
391
392Also note the return type of \lstinline'add_expr' is a dependent pair.  We
393build the resulting \textsf{RTLabs} function incrementally, using a type
394\lstinline'partial_fn' that does not contain the final invariant for
395functions.  We always require the \lstinline'fn_graph_included' property for
396partially built functions to show that the graph only gets larger, a key part
397of the proof that the resulting control flow graph is closed.  Dependent pairs
398are used in a similar manner in the \textsf{Clight} to \textsf{Cminor} phase
399too.
400
401This work does not currently cover all of the possible sources of failure; in
402particular some structural constraints on functions are not yet covered and
403some properties of \textsf{RTLabs} programs that may be useful for later
404stages or the correctness proofs are not produced.  Moreover, we may
405experiment with variations to try to make the proof obligations and syntax
406simpler to deal with.  However, it does show that retrofitting these
407properties using dependent types in \matita{} is feasible.
408
409\section{Testing}
410
411To provide some early testing and bug fixing of this code we constructed it in
412concert with the executable semantics described in deliverable 3.3, and
413\matita{} term pretty printers in the prototype compiler.  Using these, we
414were able to test the phases individually and together by running programs
415within the proof assistant itself, and comparing the results with the expected
416output.
417
418\section{Conclusion}
419
420\bibliographystyle{plain}
421\bibliography{report}
422
423\end{document}
Note: See TracBrowser for help on using the repository browser.