source: Deliverables/D3.2/Report/report.tex @ 1235

Last change on this file since 1235 was 1235, checked in by campbell, 8 years ago

Some basic material for D3.2, but not yet finished.

File size: 13.5 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.2\\
70CIC encoding: Front-end\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We describe the translation of the front end of the \cerco{} compiler from the
102\ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the
103\matita{} proof assistant.  This transforms programs in the C-like
104\textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably
105target-independent and in the form of a control flow graph.
106
107We also report on progress enriching these transformations with dependent
108types so as to establish key invariants in each intermediate language, which
109will assist in future work proving correctness properties of the compiler.
110
111\newpage 
112
113\tableofcontents
114
115% TODO: clear up any -ize vs -ise
116% TODO: clear up any "front end" vs "front-end"
117% TODO: clear up any mentions of languages that aren't textsf'd.
118% TODO: fix unicode in listings
119% TODO: make it clear that our correctness is intended to go beyond compcert's
120
121\section{Introduction}
122
123The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but
124the certified compiler will be a program written in the Calculus of Inductive
125Constructions (CIC), as realised by the \matita{} proof assistant.  This
126deliverable reports on the translation of the front-end of the compiler into
127CIC and the subsequent efforts to start exploiting dependent types to maintain
128invariants and rule out potential sources of failure in the compiler.
129
130The input language for the formalized compiler is the \textsf{Clight}
131language.  This is a C-like language with side-effect free expressions that
132was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will
133also use their CIL-based C parser to generate \textsf{Clight} abstract syntax
134trees, but will not formalize this code.} and provided with an executable
135semantics.  See~\cite{d3.1} for more details on the syntax and semantics.
136
137\begin{figure}
138\begin{tabbing}
139\quad \= $\downarrow$ \quad \= \kill
140\textsf{C} (unformalized)\\
141\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
142\textsf{Clight}\\
143\> $\downarrow$ \> cast removal\\
144\> $\downarrow$ \> add runtime functions\\
145\> $\downarrow$ \> labelling\\
146\> $\downarrow$ \> stack variable allocation and control structure
147 simplification\\
148\textsf{Cminor}\\
149\> $\downarrow$ \> generate global variable initialisation code\\
150\> $\downarrow$ \> transform to RTL graph\\
151\textsf{RTLabs}\\
152\> $\downarrow$ \> start of target specific backend\\
153\ \  \dots
154\end{tabbing}
155\caption{Front-end languages and transformations}
156\label{fig:summary}
157\end{figure}
158
159The front-end of the compiler is summarised in Figure~\ref{fig:summary}.  The
160two intermediate languages involved are
161\begin{description}
162\item[\textsf{Cminor}] --- a C-like language where local variables are no
163longer held explicitly in memory and control structures are simpler
164
165\item[\textsf{RTLabs}] --- a language in the form of a control flow graph but
166retaining the front end operations from \textsf{Cminor}
167\end{description}
168More details on the formalisation of the syntax and semantics of these
169languages can be found in the accompanying deliverable 3.4.
170Development of the formalized front-end was conducted in concert with the
171development of these intermediate languages to facilitate testing.
172
173\subsection{Revisions to the prototype compiler}
174
175We have been tracking revisions to the prototype compiler during the
176development of the formalized version.  These were usually minor, with the
177exception of the following major revision to the compiler.
178
179The original plan for the front-end featured a \textsf{Clight} to
180\textsf{Clight8} phase near the start which replaced all of the integer
181values and operations by 8 bit counterparts, while pointers were split into
182bytes at a later stage.  Experimentation found that it would be difficult to
183produce good code with this approach.  Instead, we now have:
184\begin{itemize}
185\item full size integers, pointers and operations until code selection (the
186first part of the back-end after \textsf{RTLabs}), and
187\item a cast removal stage which simplifies \textsf{Clight} expressions such
188as
189\begin{lstlisting}[language=C]
190  (char)((int)x + (int)y)
191\end{lstlisting}
192which are produced by integer promotion built into the CIL parser into
193equivalent operations on simpler types, \lstinline'x+y' in this case.
194\end{itemize}
195
196This document describes the formalized front end after these changes.
197
198\section{Clight phases}
199
200In addition to the conversion to \textsf{Cminor}, there are several
201transformations that act directly on the \textsf{Clight} language.
202
203\subsection{Cast simplification}
204
205We noted above that the arithmetic promotion required by C (and implemented in
206the CIL-based parser) adds numerous casts, causing arithmetic operations to be
207performed on 32 bit integers.  If left alone, the resulting code will be
208larger and slower.  This phase removes many of the casts so that the
209operations can be performed more efficiently.
210
211The prototype version worked by recognising fixed patterns in the
212\textsf{Clight} abstract syntax tree such as
213\[ (t)((t_1)e_1 op (t_2)e_2) \]
214subject to restrictions on the types, and replaces them with a simpler
215version.  These deep pattern matches are slightly awkward in \matita{} and
216they do not capture compositions of operations, such as
217\begin{lstlisting}[language=C]
218(char)((int)a + (int)b + (int)c)
219\end{lstlisting}
220where \lstinline'a', \lstinline'b' and \lstinline'c' are of type
221\lstinline'char'.
222
223The formalized version uses a different method, recursively examining each
224expression constructor to see if the expression can be coerced to some
225`desired' type.  For example, when processing the above expression it reaches
226each \lstinline'int' cast with a desired type of \lstinline'char', notes that
227the subexpression is of type \lstinline'char' and eliminates the cast.
228Moreover, when the recursive processing is complete the \lstinline'char' cast
229is also eliminated because its subexpression is now of the correct type.
230
231This has been formalized in \matita.  We have also performed a few proofs that
232the arithmetic behind these changes is correct to gain confidence in the
233technique.  During task 3.4 we will extend these proofs to cover more
234operations and show that the semantics of the expressions are equivalent, not
235just the underlying arithmetic.
236
237\subsection{Labelling}
238
239This phase adds cost labels to the \textsf{Clight} program.  It is a fairly
240simple recursive definition, and was straightforward to port to \matita.  The
241generation of cost labels was handled by our generic identifiers code,
242described in the accompanying deliverable 3.3 on intermediate languages.
243
244\subsection{Runtime functions}
245% TODO: this hasn't been implemented yet
246
247\subsection{Conversion to Cminor}
248
249The conversion to \textsf{Cminor} performs two essential tasks.  First, it
250determines which local variables need to be stored in memory and generates
251explicit memory accesses for them.  Second, it must translate the control
252structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s
253more basic structures.
254
255These are both performed by code similar to that in the prototype, although
256the use of generic fold operations on statements and expressions has been
257replaced by simpler recursive definitions.
258
259There are two additional pieces of work that the formalized translation must
260do.  The \textsf{Cminor} definition features some mild constraints of the
261types of expressions, which we can enforce in the translation using some type
262checking.  The error monad is used to dispose of ill-typed \textsf{Clight}
263programs.
264
265The other difficulty is that we need to generate fresh temporary variables to
266store function results in before they are written to memory.  This is
267necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions
268as the destination for the returned value, but \textsf{Cminor} only allows
269local variables.  However, the variable names are supplied with the input
270program, but without any method for generating fresh names.
271
272Our identifiers are based on binary numbers, and generation of fresh names is
273handled by keeping track of the highest allocated number.  Thus we create a
274new name generator from the input program by finding the maximum number used.
275
276\section{Cminor phases}
277
278Only two phases deal with \textsf{Cminor} programs:
279
280\subsection{Initialisation code}
281
282This replaces the initialisation data with explicit code in the main function.
283The only remarkable point in the formalization is that we have two slightly
284different instantiations of the \textsf{Cminor} syntax: one with
285initialisation data that this pass takes as input, and one with only size
286information that is the output.  In addition to reflecting the purpose of this
287pass in the types, it also ensures that it cannot be accidentally left out.
288
289\subsection{Conversion to RTLabs}
290
291This pass breaks down the structure of the \textsf{Cminor} program into a
292control flow graph, but maintains the same set of operations.  The algorithm
293is stateful in the sense that it builds up the \textsf{RTLabs} function body
294incrementally, but all of the relevant state is already present in the
295function record (including the fresh register and graph label name generators)
296and the prototype passes this around.  Thus the formalized code is very
297similar in nature.
298
299One possible variation would be to explicitly define a state monad to carry
300the function under construction around, but it is not yet clear if this will
301make the correctness easier to prove.
302
303\section{Adding and using invariants}
304
305The compiler phases described above all use the error monad to deal with
306inconsistencies in the program being transformed.  In particular, lookups in
307environments may fail, control flow graphs may have missing statements and
308various structural problems may be present.  We would like to show that these
309failures are absent where possible by establishing that programs are well
310formed early in the compilation process.
311
312This work overlaps with deliverable 3.3 (where more details of the additions
313to the syntax and semantics of the intermediate languages can be found) and
314task 3.4 on the correctness of the compiler.  Thus this work is experimental
315in nature, and will evolve during task 3.4.
316
317Invariants on \textsf{Cminor} programs are defined using a higher order
318predicate which applies a given predicate to a statement and recursively to
319each of its substatements (and expressions and subexpressions, respectively).
320
321Thus during the \textsf{Clight} to \textsf{Cminor} stage the values returned
322are not just \textsf{Cminor} expressions and statements, but dependent pairs
323that also return invariants to establish that only local variables appear in
324the generated terms, and all labels appearing in \texttt{goto} statements are
325defined in the result.  The latter is proved by showing two properties: first
326by checking whether the labels are in the set defined by the original
327\textsf{Clight} function body, and second that the translation preserves the
328set of label statements.
329
330These two properties are used at the end of translation to show the invariant
331attached to \textsf{Cminor} \emph{functions}: that all \texttt{goto} labels
332are defined in the body.  Similarly, the invariant for variables is slightly
333different to the translation's: we go from the fact that every variable was
334classified as local to the appearance of every variable in the locals or
335parameters.  This is done using a lemma showing that the classification only
336allows parameters and locals to be classed as `local'.
337
338The next translation to \textsf{RTLabs}\dots
339
340\section{Conclusion}
341
342\bibliographystyle{plain}
343\bibliography{report}
344
345\end{document}
Note: See TracBrowser for help on using the repository browser.