# source:Deliverables/D3.3/Report/report.tex@1317

Last change on this file since 1317 was 1317, checked in by campbell, 10 years ago

Fix listings in D3.2/D3.3.

File size: 21.8 KB
Line
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.3\\
70Executable Formal Semantics of front end intermediate languages\\
71\end{LARGE}
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We report on the formalization of the front end intermediate languages for the
102\cerco{} project's compiler using executable semantics in the \matita{} proof
103assistant.  This includes common definitions for fresh identifier handling,
104$n$-bit arithmetic and operations and testing of the smeantics.  We also
105consider embedding invariants into the semantics for showing correctness
106properties.
107
108\newpage
109
110\tableofcontents
111
112% TODO: clear up any -ize vs -ise
113% TODO: clear up any "front end" vs "front-end"
114% TODO: clear up any mentions of languages that aren't textsf'd.
115% TODO: fix unicode in listings
116
117\section{Introduction}
118
119This work formalizes the front end intermediate languages from the \cerco{}
120prototype compiler, described in previous deliverables~\cite{d2.1,d2.2}.  The
121front end of the compiler is summarized in Figure~\ref{fig:summary} including
122the intermediate languages and the compiler passes described in the
123accompanying deliverable 3.2.  We have also refined parts of the formal
124development that are used for several of the languages in the compiler.
125
126The input language to the formalized front end is the \textsf{Clight}
127language.  The executable semantics for this language were presented in a
128previous deliverable~\cite{d3.1}.  Here we will report on some
129minor changes to its semantics made to better align it with the whole
130development.
131
132The formalization of each language takes the form of definitions for abstract
133syntax and functions providing a small-step executable semantics.  This is
134done in the Calculus of Inductive Constructions (CIC), as implemented in the
135\matita{} proof assistant.  These definitions will be essential for the
136correctness proofs of the compiler in task 3.4.
137
138Finally, we will report on work to add several invariants to the
139languages.  This activity overlaps with task 3.4 on the correctness of the
140compiler front end.  However, the use of dependent types mean that this work
141is tied closely to the definition of the languages and the transformations of
142the front end in task 3.2.  By considering it now we can experiment with and
143judge its impact on the formal semantics, and how feasible retrofitting such
144invariants is.
145
146\begin{figure}
147\begin{center}
148\begin{minipage}{.8\linewidth}
149\begin{tabbing}
150\quad \= $\downarrow$ \quad \= \kill
151\textsf{C} (unformalized)\\
152\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
153\textsf{Clight}\\
154\> $\downarrow$ \> cast removal\\
155\> $\downarrow$ \> add runtime functions\\
156\> $\downarrow$ \> labelling\\
157\> $\downarrow$ \> stack variable allocation and control structure
158 simplification\\
159\textsf{Cminor}\\
160\> $\downarrow$ \> generate global variable initialisation code\\
161\> $\downarrow$ \> transform to RTL graph\\
162\textsf{RTLabs}\\
163\> $\downarrow$ \> start of target specific backend\\
165\end{tabbing}
166\end{minipage}
167\end{center}
168\caption{Front end languages and transformations}
169\label{fig:summary}
170\end{figure}
171
172\subsection{Revisions to the prototype compiler}
173
174Ongoing work to maintain and improve the prototype compiler has resulted in
175several changes, mostly minor.  The most important change is that the
176transformations to replace 16 and 32 bit types have been moved from the
177\textsf{Clight} language to the target-specific stage between \textsf{RTLabs}
178and \textsf{RTL} to help generate better code, and the addition of a
179\textsf{Clight} cast removal transformation to reduce the number of 16 and 32
180bit operations.
181
182The formalized semantics have tracked these changes, and in this report we
183describe them as they currently stand.
184
185\section{Definitions common to several languages}
186
187The semantics for many of the languages in the compiler share some core parts:
188the memory model, environments, runtime values, definitions of operations and
189a monad for encapsulating failure and I/O (using resumptions).  The executable
190memory model was ported from CompCert as part of the work on \textsf{Clight}
191and was reused for the front end languages\footnote{However, it is likely that
192we will revise the memory model to make it better suited for describing the
193back-end of our compiler.}.  The failure and I/O monad was introduced in the
194previous deliverable on \textsf{Clight}~\cite[\S4.2]{d3.1}.  In all of the
195languages except \textsf{Clight} we have a basic form of type, \lstinline'typ'
196identifying integers and pointers along with their sizes.  The other parts
197are discussed below, with the only change to the runtime values being the
198representation of integers.
199
200\subsection{Identifiers}
201
202Each language features one or more kinds of names to represent variables, such
203as registers, \texttt{goto} labels or RTL graph labels.  We also need to
204describe various maps whose domain is a set of identifiers when defining the
205semantics and compilation.
206
207Previous work on the executable semantics of the target machine
208code included bit vectors and bit vector tries to define various integers in
209the semantics, and give a low level view of memory~\cite{d4.1}.  To keep the
210size of the development down we have reused these data structures for
211identifiers and maps respectively.
212
213One difficulty with using fixed size bit vectors for identifiers is that
214fresh name generation can fail if generate too many.  While we use an error
215monad to deal with failures, we wish to minimize its use in the compiler.
216Thus we add a flag to detect overflows, and check it after the phase of the
217compiler is complete to report exhaustion.  The rest of the phase can then be
218written as if name generation always succeeds.  In practice this will never
219occur on normal programs because more identifiers of each sort are available
220than bytes of code memory on the target.
221
222Given the wide variety of identifiers used in the compiler we also
223wish to separate the different classes of identifier.  Thus we encapsulate the
224bit vector representing the identifier in a datatype that also carries a tag
225identifying which class of identifier we are using:
226\begin{lstlisting}
227    inductive identifier (tag:String) : Type[0] ≝
228      an_identifier : Word $\rightarrow$ identifier tag.
229\end{lstlisting}
230The tries are also tagged in the same manner.  These tags have also proved
231useful during testing by making the resulting terms more readable.
232
233\subsection{Machine integers and arithmetic}
234
235The bit vectors in~\cite{d4.1} also came equipped with some basic arithmetic
236for the target semantics.  The front end required these operations to be
237generalized and extended.  In particular, we required operations such as zero
238and sign extension and translation between bit vectors and full integers.  It
239also became apparent that while the original definitions worked reasonably on
2408-bit vectors, they did not scale up to 32-bit integers.  The definitions were
241then reworked to make them efficient enough to animate programs in the
242front-end semantics.
243
244\subsection{Front end operations}
245
246The two front end intermediate languages, \textsf{Cminor} and \textsf{RTLabs},
247share the same set of operations on values.  They differ from
248\textsf{Clight}'s operations by incorporating casts and by having a separate
249operation for each type of data operated upon.  For example, subtraction of
250pointers is treated as a different operation from subtraction of integers.
251
252A common semantics is given for these operations in the form of simple
253CIC functions on the operation and runtime values.
254
255\section{\textsf{Clight} modifications}
256
257The \textsf{Clight} input language remained largely the same as in the
258previous deliverable~\cite{d3.1}.  The principal changes were to use the
259identifiers and arithmetic described above in place of the arbitrarily large
260integers used before.  For the identifiers, this relieved us of the burden of
261adding an efficient datatype for maps by reusing the bit vector tries instead.
262
263The arithmetic replaced a dependent pair of an arbitrary integer and a proof
264that it was in range of 32 bit integers by the exact bit vector for each
265size of integer.  This direct approach is closer to the implementation and
266more obviously correct --- no extra precision can be left in by accident.
267
268\section{\textsf{Cminor}}
269
270The \textsf{Cminor} language does not store local variables in memory, and has
271simpler control structures than \textsf{Clight}.  It is similar in nature to
272the \textsf{Cminor} language in CompCert, although the semantics have been
273based on the \cerco{} prototype rather than ported from CompCert.  The syntax
274is similar to the prototype, except that the types attached to expressions are
275restricted so that some corner cases are ruled out in the \textsf{Cminor} to
276\textsf{RTLabs} stage (see the accompanying deliverable 3.2 for details):
277\begin{lstlisting}
278    inductive expr : typ $\rightarrow$ Type[0] ≝
279    | Id : $\forall$t. ident $\rightarrow$ expr t
280    | Cst : $\forall$t. constant $\rightarrow$ expr t
281    | Op1 : $\forall$t,t'. unary_operation $\rightarrow$ expr t $\rightarrow$ expr t'
282    | Op2 : $\forall$t1,t2,t'. binary_operation $\rightarrow$ expr t1 $\rightarrow$ expr t2 $\rightarrow$ expr t'
283    | Mem : $\forall$t,r. memory_chunk $\rightarrow$ expr (ASTptr r) $\rightarrow$ expr t
284    | Cond : $\forall$sz,sg,t. expr (ASTint sz sg) $\rightarrow$ expr t $\rightarrow$ expr t $\rightarrow$ expr t
285    | Ecost : $\forall$t. costlabel $\rightarrow$ expr t $\rightarrow$ expr t.
286\end{lstlisting}
287For example, note that conditional expressions only switch on integer
288expressions.
289In principle we could extend this to statically ensure that only well-typed
290\textsf{Cminor} expressions are considered, and we will consider this as part
291of the work on correctness in task 3.4.
292
293We also provide a variant of the syntax where the only initialization data is
294the size of each global variable, for use after the initialization code has
295been generated.
296
297The definition of the semantics is routine: a functional definition of a
298single small step of the machine is given, reusing the memory model,
299environments, arithmetic and operations mentioned above.
300
301\section{\textsf{RTLabs}}
302
303The \textsf{RTLabs} language provides a target independent Register Transfer
304Language, where programs are represented as control flow graphs.  We use the
305identifiers described above for the graph labels and the maps for the graph
306itself.  The tagging mechanism ensures that labels cannot be mixed up with
307other identifiers in the program (in particular, we cannot accidentally reuse
308a \texttt{goto} label from Cminor where a graph label should appear).
309
310Otherwise, the syntax and semantics of \textsf{RTLabs} mirrors that of the
311prototype compiler.  Some of the syntax is shown below, including the type of
312the control flow graphs.  The same common elements are used as for \textsf{Cminor},
313including the front end operations mentioned above.
314
315\begin{lstlisting}[language=matita]
316inductive statement : Type[0] ≝
317| St_skip : label $\rightarrow$ statement
318| St_cost : costlabel $\rightarrow$ label $\rightarrow$ statement
319| St_const : register $\rightarrow$ constant $\rightarrow$ label $\rightarrow$ statement
320| St_op1 : unary_operation $\rightarrow$ register $\rightarrow$ register $\rightarrow$ label $\rightarrow$ statement
321...
322| St_return : statement
323.
324
325definition graph : Type[0] $\rightarrow$ Type[0] ≝ identifier_map LabelTag.
326
327record internal_function : Type[0] ≝
328{ f_labgen    : universe LabelTag
329; f_reggen    : universe RegisterTag
330; f_result    : option (register $\times$ typ)
331; f_params    : list (register $\times$ typ)
332; f_locals    : list (register $\times$ typ)
333; f_stacksize : nat
334; f_graph     : graph statement
335}.
336\end{lstlisting}
337
338\section{Testing}
339
340To provide some assurance that the semantics were properly implemented, and to
341support the testing described in the accompanying deliverable 3.2, we have
342adapted the pretty printers in the prototype compiler to produce \matita{}
343terms for the syntax of each language described above.
344
345A few common definitions were added for animating the small step semantics
346definitions for any of the front end languages in \matita.  We then used a
347small selection of test cases to ensure basic functionality.  However, this is
348still a time consuming process, so more testing will carried out once the
349extraction of CIC terms to \ocaml{} programs is implemented in \matita.
350
351\section{Embedding invariants}
352
353Each phase of the prototype compiler can fail in a number of places if the
354input language permits programs that are badly structured in some sense: a
355missing label in a \texttt{goto} statement or CFG, an undefined variable name,
356a \texttt{break} statement outside of a loop or \texttt{switch}, and so on.
357We wish to restrict our intermediate languages using dependent types to remove
358as many `junk' programs as possible to rule out such failures.  We also hope
359that such restrictions will help in other correctness proofs.
360
361This goal lies in the overlap between several tasks in the project: it
362involves manipulating the syntax and semantics of the intermediate languages
363(the present work), the encoding of the front-end compiler phases in \matita{}
364(task 3.2) and the correctness of the front-end (task 3.4).  Thus this work is
365rather experimental; it is being carried out on branches in our source code
366repository and the final form will be decided and merged in during task 3.4.
367
368So far we have tried adding two forms of invariant --- one using dependent
369types to index statements in \textsf{Cminor} by their block depth, and the
370other asserts that variables and labels are present in the appropriate
371environments by adding a separate invariant to each function.  Note that these
372do not yet cover all of the properties that a program in these languages is
373expected to enjoy; for example, there are currently no checks that references
374to globals are well-defined.
375
376\subsection{\textsf{Cminor} block depth}
377
378The \textsf{Cminor} language has relatively simple control structures.
379Statements are provided for infinite loops, non-looping blocks and
380exiting an arbitrary number of blocks (for \texttt{break}, failing loop guards
381and the switch statement\footnote{We are considering replacing the
382\textsf{Cminor} switch statement with one that uses \texttt{goto}-like labels
383in both the prototype and the formalized compilers, but for now we stick with
384this CompCert-style arrangement.}).
385
386However, this means that there are badly-formed \textsf{Cminor} programs such
387as
388\begin{lstlisting}[language={}]
389    int main() {
390      block {
391        loop {
392          exit 5
393        }
394      }
395    }
396\end{lstlisting}
397where we attempt to exit more blocks than exist.  To rule these out (including
398demonstrating that the previous phase of the compiler does not generate them)
399we can index the statements of the language by the depth of the enclosing
400blocks.
401
402The adaption of the syntax adds the depth to every statement, and uses bounded
403integers in the exit and switch statements:
404\begin{lstlisting}[language=matita]
405    inductive stmt : $\forall$blockdepth:nat. Type[0] ≝
406    | St_skip : $\forall$n. stmt n
407    ...
408    | St_loop : $\forall$n. stmt n $\rightarrow$ stmt n
409    | St_block : $\forall$n. stmt (S n) $\rightarrow$ stmt n
410    | St_exit : $\forall$n. Fin n $\rightarrow$ stmt n
411    (* expr to switch on, table of <switch value, #blocks to exit>, default *)
412    | St_switch : expr $\rightarrow$ $\forall$n. list (int $\times$ (Fin n)) $\rightarrow$ Fin n $\rightarrow$ stmt n
413    ...
414\end{lstlisting}
415where \texttt{stmt n} is a statement enclosed in \texttt{n} blocks,
416and \texttt{Fin n} is a standard construction for a natural number which is at
417most \texttt{n}.
418
419In the semantics the number of blocks is also added to the continuations and
420state, and the function to find the continuation from an exit statement can be
422We note in passing that adding this parameter detected a small mistake in the
423semantics concerning continuations and tail calls, although the mistake itself
424was benign.
425
426\subsection{Identifier invariants}
427
428To show that the variables and labels occurring in the body of a function
429are present in the relevant structures we add an additional invariant to the
430function records.
431
432For \textsf{Cminor} we use a higher-order predicate which recursively applies
433a predicate to all substatements:
434\begin{lstlisting}[language=matita]
435let rec stmt_P (P:stmt $\rightarrow$ Prop) (s:stmt) on s : Prop ≝
436match s with
437[ St_seq s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
438| St_ifthenelse _ _ _ s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
439| St_loop s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
440| St_block s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
441| St_label _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
442| St_cost _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
443| _ $\Rightarrow$ P s
444].
445\end{lstlisting}
446Dependent pattern matching on statements thus allows an accompanying
447\lstinline'stmt_P' fact to be unfold to the predicate on the current statement
448and the predicate applied to all substatements.
449
450We require two properties to hold in \textsf{Cminor} functions:
451\begin{enumerate}
452\item All variables in the body are present in the list of parameters or the
453list of variables for the function (this also uses a similar recursive
454predicate on expressions).
455\item All labels in \texttt{goto} statements appear in a label statement.
456\end{enumerate}
457The function definition thus becomes:
458\begin{lstlisting}[language=matita]
459record internal_function : Type[0] ≝
460{ f_return    : option typ
461; f_params    : list (ident $\times$ typ)
462; f_vars      : list (ident $\times$ typ)
463; f_stacksize : nat
464; f_body      : stmt
465; f_inv       : stmt_P ($\lambda$s.stmt_vars ($\lambda$i.Exists ? ($\lambda$x.\fst x = i) (f_params @ f_vars)) s $\wedge$
466                           stmt_labels ($\lambda$l.Exists ? ($\lambda$l'.l' = l) (labels_of f_body)) s) f_body
467}.
468\end{lstlisting}
469where \lstinline'stmt_vars' and \lstinline'stmt_labels' constrain the
470variables and labels that appear directly in a statement (but not
471substatements) to appear in the given list, and
472\lstinline'labels_of' returns a list of all the labels defined in a
473statement.
474
475The \textsf{Clight} semantics can be amended to use these invariants, although
476the main benefit is for the compiler stages (see the accompanying deliverable
4773.2 for details).  The semantics require the invariants to be added to the
478state and continuations.  It was convenient to split the continuations between
479the local continuation representing the rest of the code to be executed within
480the function, and the stack of function calls because it becomes easier to
481state the property on the local continuation alone.
482  The invariant for variables is
483slightly different --- we require that every variable appear in the local
484environment.  We use \lstinline'f_inv' from the function to establish this
485invariant when the environment is set up on function entry.
486
487It is unclear whether changing the semantics is really worthwhile.  It
488witnesses that the invariants are those we wanted, but makes no difference to
489the actual execution of the program, especially as the execution can still
490fail due to genuine runtime errors.  Moreover, it is unclear what effect the
491presence of proof terms and more dependent pattern matching in the semantics
492will have on the complexity of future correctness proofs.  We plan to examine
494
495We use a similar method to specify the invariant that the \textsf{RTLabs}
496graph is closed --- that is, any successor labels in a statement in the graph
497are present in the graph.  The definition is simpler in \textsf{RTLabs}
498because the flat representation of the graph does not require recursive
499definitions like \lstinline'stmt_P' above.
500
501\section{Conclusion}
502
503We have developed executable semantics for each of the front-end languages of
504the \cerco{} compiler.  These will form the basis of the correctness
505statements for each stage of the compiler in task 3.4.  We have also shown
506that useful invariants can be added as dependent types, and intend to use
507these in subsequent work.
508
509\bibliographystyle{plain}
510\bibliography{report}
511
512\end{document}
Note: See TracBrowser for help on using the repository browser.