source: Deliverables/D3.3/Report/report.tex @ 1346

Last change on this file since 1346 was 1346, checked in by campbell, 9 years ago

Minor corrections and a paragraph of context in the abstract to deliverables 3.2 and 3.3.

File size: 22.2 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.3\\
70Executable Formal Semantics of front-end intermediate languages\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We report on the formalization of the front-end intermediate languages for the
102\cerco{} project's compiler using executable semantics in the \matita{} proof
103assistant.  This includes common definitions for fresh identifier handling,
104$n$-bit arithmetic and operations and testing of the semantics.  We also
105consider embedding invariants into the semantics for showing correctness
106properties.
107
108This work was Task~3.3 of the \cerco{} project and the languages that were
109formalized were first described in Deliverable~2.1~\cite{d2.1}.  They are used
110by the formalized front-end in Task~3.2 to provide the syntax for the
111transformations, and to test them by animating programs in the executable
112semantics.  It will also be a crucial part of the specifications for the
113correctness results in Task~3.4.
114
115\newpage 
116
117\tableofcontents
118
119% CHECK: clear up any -ize vs -ise
120% CHECK: clear up any "front end" vs "front-end"
121% CHECK: clear up any mentions of languages that aren't textsf'd.
122% CHECK: fix unicode in listings
123
124\section{Introduction}
125
126This work formalizes the front-end intermediate languages from the \cerco{}
127prototype compiler, described in previous deliverables~\cite{d2.1,d2.2}.  The
128front-end of the compiler is summarized in Figure~\ref{fig:summary} including
129the intermediate languages and the compiler passes described in the
130accompanying Deliverable 3.2.  We have also refined parts of the formal
131development that are used for several of the languages in the compiler.
132
133The input language to the formalized front-end is the \textsf{Clight}
134language.  The executable semantics for this language were presented in a
135previous deliverable~\cite{d3.1}.  Here we will report on some
136minor changes to its semantics made to better align it with the whole
137development.
138
139The formalization of each language takes the form of definitions for abstract
140syntax and functions providing a small-step executable semantics.  This is
141done in the Calculus of Inductive Constructions (CIC), as implemented in the
142\matita{} proof assistant.  These definitions will be essential for the
143correctness proofs of the compiler in Task 3.4.
144
145Finally, we will report on work to add several invariants to the
146languages.  This activity overlaps with Task 3.4 on the correctness of the
147compiler front-end.  However, the use of dependent types mean that this work
148is tied closely to the definition of the languages and the transformations of
149the front-end in Task 3.2.  By considering it now we can experiment with and
150judge its impact on the formal semantics, and how feasible retrofitting such
151invariants is.
152
153\begin{figure}
154\begin{center}
155\begin{minipage}{.8\linewidth}
156\begin{tabbing}
157\quad \= $\downarrow$ \quad \= \kill
158\textsf{C} (unformalized)\\
159\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
160\textsf{Clight}\\
161\> $\downarrow$ \> cast removal\\
162\> $\downarrow$ \> add runtime functions\\
163\> $\downarrow$ \> labelling\\
164\> $\downarrow$ \> stack variable allocation and control structure
165 simplification\\
166\textsf{Cminor}\\
167\> $\downarrow$ \> generate global variable initialization code\\
168\> $\downarrow$ \> transform to RTL graph\\
169\textsf{RTLabs}\\
170\> $\downarrow$ \> start of target specific back-end\\
171\>\quad \vdots
172\end{tabbing}
173\end{minipage}
174\end{center}
175\caption{Front-end languages and transformations}
176\label{fig:summary}
177\end{figure}
178
179\subsection{Revisions to the prototype compiler}
180
181Ongoing work to maintain and improve the prototype compiler has resulted in
182several changes, mostly minor.  The most important change is that the
183transformations to replace 16 and 32 bit types have been moved from the
184\textsf{Clight} language to the target-specific stage between \textsf{RTLabs}
185and \textsf{RTL} to help generate better code, and the addition of a
186\textsf{Clight} cast removal transformation to reduce the number of 16 and 32
187bit operations.
188
189The formalized semantics have tracked these changes, and in this report we
190describe them as they currently stand.
191
192\section{Definitions common to several languages}
193
194The semantics for many of the languages in the compiler share some core parts:
195the memory model, environments, runtime values, definitions of operations and
196a monad for encapsulating failure and I/O (using resumptions).  The executable
197memory model was ported from CompCert as part of the work on \textsf{Clight}
198and was reused for the front-end languages\footnote{However, it is likely that
199we will revise the memory model to make it better suited for describing the
200back-end of our compiler.}.  The failure and I/O monad was introduced in the
201previous deliverable on \textsf{Clight}~\cite[\S4.2]{d3.1}.  In all of the
202languages except \textsf{Clight} we have a basic form of type, \lstinline'typ'
203identifying integers and pointers along with their sizes.  The other parts
204are discussed below, with the only change to the runtime values being the
205representation of integers.
206
207\subsection{Identifiers}
208
209Each language features one or more kinds of names to represent variables, such
210as registers, \texttt{goto} labels or RTL graph labels.  We also need to
211describe various maps whose domain is a set of identifiers when defining the
212semantics and compilation.
213
214Previous work on the executable semantics of the target machine
215code included bit vectors and bit vector tries to define various integers in
216the semantics, and give a low level view of memory~\cite{d4.1}.  To keep the
217size of the development down we have reused these data structures for
218identifiers and maps respectively.
219
220One difficulty with using fixed size bit vectors for identifiers is that
221fresh name generation can fail if generate too many.  While we use an error
222monad to deal with failures, we wish to minimize its use in the compiler.
223Thus we add a flag to detect overflows, and check it after the phase of the
224compiler is complete to report exhaustion.  The rest of the phase can then be
225written as if name generation always succeeds.  In practice this will never
226occur on normal programs because more identifiers of each sort are available
227than bytes of code memory on the target.
228
229Given the wide variety of identifiers used in the compiler we also
230wish to separate the different classes of identifier.  Thus we encapsulate the
231bit vector representing the identifier in a datatype that also carries a tag
232identifying which class of identifier we are using:
233\begin{lstlisting}
234    inductive identifier (tag:String) : Type[0] ≝
235      an_identifier : Word $\rightarrow$ identifier tag.
236\end{lstlisting}
237The tries are also tagged in the same manner.  These tags have also proved
238useful during testing by making the resulting terms more readable.
239
240\subsection{Machine integers and arithmetic}
241
242The bit vectors in~\cite{d4.1} also came equipped with some basic arithmetic
243for the target semantics.  The front-end required these operations to be
244generalized and extended.  In particular, we required operations such as zero
245and sign extension and translation between bit vectors and full integers.  It
246also became apparent that while the original definitions worked reasonably on
2478-bit vectors, they did not scale up to 32-bit integers.  The definitions were
248then reworked to make them efficient enough to animate programs in the
249front-end semantics.
250
251\subsection{Front-end operations}
252
253The two front-end intermediate languages, \textsf{Cminor} and \textsf{RTLabs},
254share the same set of operations on values.  They differ from
255\textsf{Clight}'s operations by incorporating casts and by having a separate
256operation for each type of data operated upon.  For example, subtraction of
257pointers is treated as a different operation from subtraction of integers.
258
259A common semantics is given for these operations in the form of simple
260CIC functions on the operation and runtime values.
261
262\section{\textsf{Clight} modifications}
263
264The \textsf{Clight} input language remained largely the same as in the
265previous deliverable~\cite{d3.1}.  The principal changes were to use the
266identifiers and arithmetic described above in place of the arbitrarily large
267integers used before.  For the identifiers, this relieved us of the burden of
268adding an efficient datatype for maps by reusing the bit vector tries instead.
269
270The arithmetic replaced a dependent pair of an arbitrary integer and a proof
271that it was in range of 32 bit integers by the exact bit vector for each
272size of integer.  This direct approach is closer to the implementation and
273more obviously correct --- no extra precision can be left in by accident.
274
275\section{\textsf{Cminor}}
276
277The \textsf{Cminor} language does not store local variables in memory, and has
278simpler control structures than \textsf{Clight}.  It is similar in nature to
279the \textsf{Cminor} language in CompCert, although the semantics have been
280based on the \cerco{} prototype rather than ported from CompCert.  The syntax
281is similar to the prototype, except that the types attached to expressions are
282restricted so that some corner cases are ruled out in the \textsf{Cminor} to
283\textsf{RTLabs} stage (see the accompanying Deliverable 3.2 for details):
284\begin{lstlisting}
285    inductive expr : typ $\rightarrow$ Type[0] ≝
286    | Id : $\forall$t. ident $\rightarrow$ expr t
287    | Cst : $\forall$t. constant $\rightarrow$ expr t
288    | Op1 : $\forall$t,t'. unary_operation $\rightarrow$ expr t $\rightarrow$ expr t'
289    | Op2 : $\forall$t1,t2,t'. binary_operation $\rightarrow$ expr t1 $\rightarrow$ expr t2 $\rightarrow$ expr t'
290    | Mem : $\forall$t,r. memory_chunk $\rightarrow$ expr (ASTptr r) $\rightarrow$ expr t
291    | Cond : $\forall$sz,sg,t. expr (ASTint sz sg) $\rightarrow$ expr t $\rightarrow$ expr t $\rightarrow$ expr t
292    | Ecost : $\forall$t. costlabel $\rightarrow$ expr t $\rightarrow$ expr t.
293\end{lstlisting}
294For example, note that conditional expressions only switch on integer
295expressions.
296In principle we could extend this to statically ensure that only well-typed
297\textsf{Cminor} expressions are considered, and we will consider this as part
298of the work on correctness in Task 3.4.
299
300We also provide a variant of the syntax where the only initialization data is
301the size of each global variable, for use after the initialization code has
302been generated.
303
304The definition of the semantics is routine: a functional definition of a
305single small step of the machine is given, reusing the memory model,
306environments, arithmetic and operations mentioned above.
307
308\section{\textsf{RTLabs}}
309
310The \textsf{RTLabs} language provides a target independent Register Transfer
311Language, where programs are represented as control flow graphs.  We use the
312identifiers described above for the graph labels and the maps for the graph
313itself.  The tagging mechanism ensures that labels cannot be mixed up with
314other identifiers in the program (in particular, we cannot accidentally reuse
315a \texttt{goto} label from Cminor where a graph label should appear).
316
317Otherwise, the syntax and semantics of \textsf{RTLabs} mirrors that of the
318prototype compiler.  Some of the syntax is shown below, including the type of
319the control flow graphs.  The same common elements are used as for \textsf{Cminor},
320including the front-end operations mentioned above.
321
322\begin{lstlisting}[language=matita]
323inductive statement : Type[0] ≝
324| St_skip : label $\rightarrow$ statement
325| St_cost : costlabel $\rightarrow$ label $\rightarrow$ statement
326| St_const : register $\rightarrow$ constant $\rightarrow$ label $\rightarrow$ statement
327| St_op1 : unary_operation $\rightarrow$ register $\rightarrow$ register $\rightarrow$ label $\rightarrow$ statement
328...
329| St_return : statement
330.
331
332definition graph : Type[0] $\rightarrow$ Type[0] ≝ identifier_map LabelTag.
333
334record internal_function : Type[0] ≝
335{ f_labgen    : universe LabelTag
336; f_reggen    : universe RegisterTag
337; f_result    : option (register $\times$ typ)
338; f_params    : list (register $\times$ typ)
339; f_locals    : list (register $\times$ typ)
340; f_stacksize : nat
341; f_graph     : graph statement
342}.
343\end{lstlisting}
344
345\section{Testing}
346
347To provide some assurance that the semantics were properly implemented, and to
348support the testing described in the accompanying Deliverable 3.2, we have
349adapted the pretty printers in the prototype compiler to produce \matita{}
350terms for the syntax of each language described above.
351
352A few common definitions were added for animating the small step semantics
353definitions for any of the front-end languages in \matita.  We then used a
354small selection of test cases to ensure basic functionality.  However, this is
355still a time consuming process, so more testing will carried out once the
356extraction of CIC terms to \ocaml{} programs is implemented in \matita.
357
358\section{Embedding invariants}
359
360Each phase of the prototype compiler can fail in a number of places if the
361input language permits programs that are badly structured in some sense: a
362missing label in a \texttt{goto} statement or CFG, an undefined variable name,
363a \texttt{break} statement outside of a loop or \texttt{switch}, and so on.
364We wish to restrict our intermediate languages using dependent types to remove
365as many `junk' programs as possible to rule out such failures.  We also hope
366that such restrictions will help in other correctness proofs.
367
368This goal lies in the overlap between several tasks in the project: it
369involves manipulating the syntax and semantics of the intermediate languages
370(the present work), the encoding of the front-end compiler phases in \matita{}
371(Task 3.2) and the correctness of the front-end (Task 3.4).  Thus this work is
372rather experimental; it is being carried out on branches in our source code
373repository and the final form will be decided and merged in during Task 3.4.
374
375So far we have tried adding two forms of invariant --- one using dependent
376types to index statements in \textsf{Cminor} by their block depth, and the
377other asserts that variables and labels are present in the appropriate
378environments by adding a separate invariant to each function.  Note that these
379do not yet cover all of the properties that a program in these languages is
380expected to enjoy; for example, there are currently no checks that references
381to globals are well-defined.
382
383\subsection{\textsf{Cminor} block depth}
384
385The \textsf{Cminor} language has relatively simple control structures.
386Statements are provided for infinite loops, non-looping blocks and
387exiting an arbitrary number of blocks (for \texttt{break}, failing loop guards
388and the switch statement\footnote{We are considering replacing the
389\textsf{Cminor} switch statement with one that uses \texttt{goto}-like labels
390in both the prototype and the formalized compilers, but for now we stick with
391this CompCert-style arrangement.}).
392
393However, this means that there are badly-formed \textsf{Cminor} programs such
394as
395\begin{lstlisting}[language={}]
396    int main() {
397      block {
398        loop {
399          exit 5
400        }
401      }
402    }
403\end{lstlisting}
404where we attempt to exit more blocks than exist.  To rule these out (including
405demonstrating that the previous phase of the compiler does not generate them)
406we can index the statements of the language by the depth of the enclosing
407blocks.
408
409The adaption of the syntax adds the depth to every statement, and uses bounded
410integers in the exit and switch statements:
411\begin{lstlisting}[language=matita]
412    inductive stmt : $\forall$blockdepth:nat. Type[0] ≝
413    | St_skip : $\forall$n. stmt n
414    ...
415    | St_loop : $\forall$n. stmt n $\rightarrow$ stmt n
416    | St_block : $\forall$n. stmt (S n) $\rightarrow$ stmt n
417    | St_exit : $\forall$n. Fin n $\rightarrow$ stmt n
418    (* expr to switch on, table of <switch value, #blocks to exit>, default *)
419    | St_switch : expr $\rightarrow$ $\forall$n. list (int $\times$ (Fin n)) $\rightarrow$ Fin n $\rightarrow$ stmt n
420    ...
421\end{lstlisting}
422where \texttt{stmt n} is a statement enclosed in \texttt{n} blocks,
423and \texttt{Fin n} is a standard construction for a natural number which is at
424most \texttt{n}.
425
426In the semantics the number of blocks is also added to the continuations and
427state, and the function to find the continuation from an exit statement can be
428made failure-free.
429We note in passing that adding this parameter detected a small mistake in the
430semantics concerning continuations and tail calls, although the mistake itself
431was benign.
432
433\subsection{Identifier invariants}
434
435To show that the variables and labels occurring in the body of a function
436are present in the relevant structures we add an additional invariant to the
437function records.
438
439For \textsf{Cminor} we use a higher-order predicate which recursively applies
440a predicate to all substatements:
441\begin{lstlisting}[language=matita]
442let rec stmt_P (P:stmt $\rightarrow$ Prop) (s:stmt) on s : Prop ≝
443match s with
444[ St_seq s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
445| St_ifthenelse _ _ _ s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
446| St_loop s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
447| St_block s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
448| St_label _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
449| St_cost _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
450| _ $\Rightarrow$ P s
451].
452\end{lstlisting}
453Dependent pattern matching on statements thus allows an accompanying
454\lstinline'stmt_P' fact to be unfold to the predicate on the current statement
455and the predicate applied to all substatements.
456
457We require two properties to hold in \textsf{Cminor} functions:
458\begin{enumerate}
459\item All variables in the body are present in the list of parameters or the
460list of variables for the function (this also uses a similar recursive
461predicate on expressions).
462\item All labels in \texttt{goto} statements appear in a label statement.
463\end{enumerate}
464The function definition thus becomes:
465\begin{lstlisting}[language=matita]
466record internal_function : Type[0] ≝
467{ f_return    : option typ
468; f_params    : list (ident $\times$ typ)
469; f_vars      : list (ident $\times$ typ)
470; f_stacksize : nat
471; f_body      : stmt
472; f_inv       : stmt_P ($\lambda$s.stmt_vars ($\lambda$i.Exists ? ($\lambda$x.\fst x = i) (f_params @ f_vars)) s $\wedge$
473                           stmt_labels ($\lambda$l.Exists ? ($\lambda$l'.l' = l) (labels_of f_body)) s) f_body
474}.
475\end{lstlisting}
476where \lstinline'stmt_vars' and \lstinline'stmt_labels' constrain the
477variables and labels that appear directly in a statement (but not
478substatements) to appear in the given list, and
479\lstinline'labels_of' returns a list of all the labels defined in a
480statement.
481
482The \textsf{Clight} semantics can be amended to use these invariants, although
483the main benefit is for the compiler stages (see the accompanying Deliverable
4843.2 for details).  The semantics require the invariants to be added to the
485state and continuations.  It was convenient to split the continuations between
486the local continuation representing the rest of the code to be executed within
487the function, and the stack of function calls because it becomes easier to
488state the property on the local continuation alone.
489  The invariant for variables is
490slightly different --- we require that every variable appear in the local
491environment.  We use \lstinline'f_inv' from the function to establish this
492invariant when the environment is set up on function entry.
493
494It is unclear whether changing the semantics is really worthwhile.  It
495witnesses that the invariants are those we wanted, but makes no difference to
496the actual execution of the program, especially as the execution can still
497fail due to genuine runtime errors.  Moreover, it is unclear what effect the
498presence of proof terms and more dependent pattern matching in the semantics
499will have on the complexity of future correctness proofs.  We plan to examine
500this issue during Task 3.4.
501
502We use a similar method to specify the invariant that the \textsf{RTLabs}
503graph is closed --- that is, any successor labels in a statement in the graph
504are present in the graph.  The definition is simpler in \textsf{RTLabs}
505because the flat representation of the graph does not require recursive
506definitions like \lstinline'stmt_P' above.
507
508\section{Conclusion}
509
510We have developed executable semantics for each of the front-end languages of
511the \cerco{} compiler.  These will form the basis of the correctness
512statements for each stage of the compiler in Task 3.4.  We have also shown
513that useful invariants can be added as dependent types, and intend to use
514these in subsequent work.
515
516\bibliographystyle{plain}
517\bibliography{report}
518
519\end{document}
Note: See TracBrowser for help on using the repository browser.