source: Deliverables/D3.3/Report/report.tex

Last change on this file was 1370, checked in by campbell, 9 years ago

D3.3: Added a subsection on the SmallstepExec?.ma definitions, plus a few related
comments.

File size: 24.2 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.3\\
70Executable Formal Semantics of front-end intermediate languages\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We report on the formalization of the front-end intermediate languages for the
102\cerco{} project's compiler using executable semantics in the \matita{} proof
103assistant.  This includes common definitions for fresh identifier handling,
104$n$-bit arithmetic and operations and testing of the semantics.  We also
105consider embedding invariants into the semantics for showing correctness
106properties.
107
108This work was Task~3.3 of the \cerco{} project and the languages that were
109formalized were first described in Deliverable~2.1~\cite{d2.1}.  They are used
110by the formalized front-end in Task~3.2 to provide the syntax for the
111transformations, and to test them by animating programs in the executable
112semantics.  It will also be a crucial part of the specifications for the
113correctness results in Task~3.4.
114
115\newpage 
116
117\tableofcontents
118
119% CHECK: clear up any -ize vs -ise
120% CHECK: clear up any "front end" vs "front-end"
121% CHECK: clear up any mentions of languages that aren't textsf'd.
122% CHECK: fix unicode in listings
123
124\section{Introduction}
125
126This work formalizes the front-end intermediate languages from the \cerco{}
127prototype compiler, described in previous deliverables~\cite{d2.1,d2.2}.  The
128front-end of the compiler is summarized in Figure~\ref{fig:summary} including
129the intermediate languages and the compiler passes described in the
130accompanying Deliverable 3.2.  We have also refined parts of the formal
131development that are used for several of the languages in the compiler.
132
133The input language to the formalized front-end is the \textsf{Clight}
134language.  The executable semantics for this language were presented in a
135previous deliverable~\cite{d3.1}.  Here we will report on some
136minor changes to its semantics made to better align it with the whole
137development.
138
139The formalization of each language takes the form of definitions for abstract
140syntax and functions providing a small-step executable semantics.  This is
141done in the Calculus of Inductive Constructions (CIC), as implemented in the
142\matita{} proof assistant.  These definitions will be essential for the
143correctness proofs of the compiler in Task 3.4.
144
145Finally, we will report on work to add several invariants to the
146languages.  This activity overlaps with Task 3.4 on the correctness of the
147compiler front-end.  However, the use of dependent types mean that this work
148is tied closely to the definition of the languages and the transformations of
149the front-end in Task 3.2.  By considering it now we can experiment with and
150judge its impact on the formal semantics, and how feasible retrofitting such
151invariants is.
152
153\begin{figure}
154\begin{center}
155\begin{minipage}{.8\linewidth}
156\begin{tabbing}
157\quad \= $\downarrow$ \quad \= \kill
158\textsf{C} (unformalized)\\
159\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
160\textsf{Clight}\\
161\> $\downarrow$ \> cast removal\\
162\> $\downarrow$ \> add runtime functions\\
163\> $\downarrow$ \> labelling\\
164\> $\downarrow$ \> stack variable allocation and control structure
165 simplification\\
166\textsf{Cminor}\\
167\> $\downarrow$ \> generate global variable initialization code\\
168\> $\downarrow$ \> transform to RTL graph\\
169\textsf{RTLabs}\\
170\> $\downarrow$ \> start of target specific back-end\\
171\>\quad \vdots
172\end{tabbing}
173\end{minipage}
174\end{center}
175\caption{Front-end languages and transformations}
176\label{fig:summary}
177\end{figure}
178
179\subsection{Revisions to the prototype compiler}
180
181Ongoing work to maintain and improve the prototype compiler has resulted in
182several changes, mostly minor.  The most important change is that the
183transformations to replace 16 and 32 bit types have been moved from the
184\textsf{Clight} language to the target-specific stage between \textsf{RTLabs}
185and \textsf{RTL} to help generate better code, and the addition of a
186\textsf{Clight} cast removal transformation to reduce the number of 16 and 32
187bit operations.
188
189The formalized semantics have tracked these changes, and in this report we
190describe them as they currently stand.
191
192\section{Definitions common to several languages}
193
194The semantics for many of the languages in the compiler share some core parts:
195the memory model, environments, runtime values, definitions of operations,
196a monad for encapsulating failure and I/O (using resumptions), and a common
197abstraction for small-step executable semantics.  The executable
198memory model was ported from CompCert as part of the work on \textsf{Clight}
199and was reused for the front-end languages\footnote{However, it is likely that
200we will revise the memory model to make it better suited for describing
201all of our compiler, not just the front-end.}.  The failure and I/O monad was introduced in the
202previous deliverable on \textsf{Clight}~\cite[\S4.2]{d3.1}.  In all of the
203languages except \textsf{Clight} we have a basic form of type, \lstinline'typ'
204identifying integers and pointers along with their sizes.  The other parts
205are discussed below, with the only change to the runtime values being the
206representation of integers.
207
208\subsection{Identifiers}
209
210Each language features one or more kinds of names to represent variables, such
211as registers, \texttt{goto} labels or RTL graph labels.  We also need to
212describe various maps whose domain is a set of identifiers when defining the
213semantics and compilation.
214
215Previous work on the executable semantics of the target machine
216code included bit vectors and bit vector tries to define various integers in
217the semantics, and give a low level view of memory~\cite{d4.1}.  To keep the
218size of the development down we have reused these data structures for
219identifiers and maps respectively.
220
221One difficulty with using fixed size bit vectors for identifiers is that
222fresh name generation can fail if generate too many.  While we use an error
223monad to deal with failures, we wish to minimize its use in the compiler.
224Thus we add a flag to detect overflows, and check it after the phase of the
225compiler is complete to report exhaustion.  The rest of the phase can then be
226written as if name generation always succeeds.  In practice this will never
227occur on normal programs because more identifiers of each sort are available
228than bytes of code memory on the target.
229
230Given the wide variety of identifiers used in the compiler we also
231wish to separate the different classes of identifier.  Thus we encapsulate the
232bit vector representing the identifier in a datatype that also carries a tag
233identifying which class of identifier we are using:
234\begin{lstlisting}
235    inductive identifier (tag:String) : Type[0] ≝
236      an_identifier : Word $\rightarrow$ identifier tag.
237\end{lstlisting}
238The tries are also tagged in the same manner.  These tags have also proved
239useful during testing by making the resulting terms more readable.
240
241\subsection{Machine integers and arithmetic}
242
243The bit vectors in~\cite{d4.1} also came equipped with some basic arithmetic
244for the target semantics.  The front-end required these operations to be
245generalized and extended.  In particular, we required operations such as zero
246and sign extension and translation between bit vectors and full integers.  It
247also became apparent that while the original definitions worked reasonably on
2488-bit vectors, they did not scale up to 32-bit integers.  The definitions were
249then reworked to make them efficient enough to animate programs in the
250front-end semantics.
251
252\subsection{Front-end operations}
253
254The two front-end intermediate languages, \textsf{Cminor} and \textsf{RTLabs},
255share the same set of operations on values.  They differ from
256\textsf{Clight}'s operations by incorporating casts and by having a separate
257operation for each type of data operated upon.  For example, subtraction of
258pointers is treated as a different operation from subtraction of integers.
259
260A common semantics is given for these operations in the form of simple
261CIC functions on the operation and runtime values.
262
263\subsection{Presentation of small-step executable semantics}
264\label{sec:smallstepexec}
265
266Each language's semantics is described by an instantiation of two records for
267defining transition systems.  We already use these to animate the semantics of
268the languages for testing, but they will also be used to state the simulation
269properties we will prove in Tasks~3.4 and~4.4.
270
271First we describe suitable transition systems,
272\begin{lstlisting}
273  record trans_system (outty:Type[0]) (inty:outty $\rightarrow$ Type[0]) : Type[2] :=
274  { global : Type[1]
275  ; state  : global $\rightarrow$ Type[0]
276  ; is_final : $\forall$g. state g $\rightarrow$ option int
277  ; step : $\forall$g. state g $\rightarrow$ IO outty inty (trace$\times$(state g))
278  }.
279\end{lstlisting}
280where we have some type of \lstinline'global' data that remains fixed
281throughout evaluation (typically used for the global environment), a type for
282states, a function to detect a final successful state and a step function
283which can also return an error or request for I/O.  The type of states may
284depend on the fixed data so that we will be able to add invariants asserting
285that (for example) all the global variables referenced by the program state
286exist.
287
288We also define a coinductive description of executions and a cofixpoint to
289produce them.  The second record extends the transition system with
290a type of programs and functions to initialise the transition system:
291\begin{lstlisting}
292  record fullexec (outty:Type[0]) (inty:outty $\rightarrow$ Type[0]) : Type[2] :=
293  { program : Type[0]
294  ; es1 :> trans_system outty inty
295  ; make_global : program $\rightarrow$ global ?? es1
296  ; make_initial_state : $\forall$p:program. res (state ?? es1 (make_global p))
297  }.
298\end{lstlisting}
299Finally another function is given which uses them to produce a full execution
300starting from the program alone.
301
302\section{\textsf{Clight} modifications}
303
304The \textsf{Clight} input language remained largely the same as in the
305previous deliverable~\cite{d3.1}.  The principal changes were to use the
306identifiers and arithmetic described above in place of the arbitrarily large
307integers used before.  For the identifiers, this relieved us of the burden of
308adding an efficient datatype for maps by reusing the bit vector tries instead.
309
310The arithmetic replaced a dependent pair of an arbitrary integer and a proof
311that it was in range of 32 bit integers by the exact bit vector for each
312size of integer.  This direct approach is closer to the implementation and
313more obviously correct --- no extra precision can be left in by accident.
314
315\section{\textsf{Cminor}}
316
317The \textsf{Cminor} language does not store local variables in memory, and has
318simpler control structures than \textsf{Clight}.  It is similar in nature to
319the \textsf{Cminor} language in CompCert, although the semantics have been
320based on the \cerco{} prototype rather than ported from CompCert.  The syntax
321is similar to the prototype, except that the types attached to expressions are
322restricted so that some corner cases are ruled out in the \textsf{Cminor} to
323\textsf{RTLabs} stage (see the accompanying Deliverable 3.2 for details):
324\begin{lstlisting}
325    inductive expr : typ $\rightarrow$ Type[0] ≝
326    | Id : $\forall$t. ident $\rightarrow$ expr t
327    | Cst : $\forall$t. constant $\rightarrow$ expr t
328    | Op1 : $\forall$t,t'. unary_operation $\rightarrow$ expr t $\rightarrow$ expr t'
329    | Op2 : $\forall$t1,t2,t'. binary_operation $\rightarrow$ expr t1 $\rightarrow$ expr t2 $\rightarrow$ expr t'
330    | Mem : $\forall$t,r. memory_chunk $\rightarrow$ expr (ASTptr r) $\rightarrow$ expr t
331    | Cond : $\forall$sz,sg,t. expr (ASTint sz sg) $\rightarrow$ expr t $\rightarrow$ expr t $\rightarrow$ expr t
332    | Ecost : $\forall$t. costlabel $\rightarrow$ expr t $\rightarrow$ expr t.
333\end{lstlisting}
334For example, note that conditional expressions only switch on integer
335expressions.
336In principle we could extend this to statically ensure that only well-typed
337\textsf{Cminor} expressions are considered, and we will consider this as part
338of the work on correctness in Task 3.4.
339
340We also provide a variant of the syntax where the only initialization data is
341the size of each global variable, for use after the initialization code has
342been generated.
343
344The definition of the semantics is routine: a functional definition of a
345single small-step of the machine is given, reusing the memory model,
346environments, arithmetic and operations mentioned above.
347
348\section{\textsf{RTLabs}}
349
350The \textsf{RTLabs} language provides a target independent Register Transfer
351Language, where programs are represented as control flow graphs.  We use the
352identifiers described above for the graph labels and the maps for the graph
353itself.  The tagging mechanism ensures that labels cannot be mixed up with
354other identifiers in the program (in particular, we cannot accidentally reuse
355a \texttt{goto} label from Cminor where a graph label should appear).
356
357Otherwise, the syntax and semantics of \textsf{RTLabs} mirrors that of the
358prototype compiler.  Some of the syntax is shown below, including the type of
359the control flow graphs.  The same common elements are used as for \textsf{Cminor},
360including the front-end operations mentioned above.
361
362\begin{lstlisting}[language=matita]
363inductive statement : Type[0] ≝
364| St_skip : label $\rightarrow$ statement
365| St_cost : costlabel $\rightarrow$ label $\rightarrow$ statement
366| St_const : register $\rightarrow$ constant $\rightarrow$ label $\rightarrow$ statement
367| St_op1 : unary_operation $\rightarrow$ register $\rightarrow$ register $\rightarrow$ label $\rightarrow$ statement
368...
369| St_return : statement
370.
371
372definition graph : Type[0] $\rightarrow$ Type[0] ≝ identifier_map LabelTag.
373
374record internal_function : Type[0] ≝
375{ f_labgen    : universe LabelTag
376; f_reggen    : universe RegisterTag
377; f_result    : option (register $\times$ typ)
378; f_params    : list (register $\times$ typ)
379; f_locals    : list (register $\times$ typ)
380; f_stacksize : nat
381; f_graph     : graph statement
382}.
383\end{lstlisting}
384
385\section{Testing}
386
387To provide some assurance that the semantics were properly implemented, and to
388support the testing described in the accompanying Deliverable 3.2, we have
389adapted the pretty printers in the prototype compiler to produce \matita{}
390terms for the syntax of each language described above.
391
392A few common definitions were added for animating the small-step semantics
393of any of the front-end languages in \matita{}, given a bound on
394the number of steps to execute and any input required.  These use the
395definitions described in Section~\ref{sec:smallstepexec} to perform the actual
396execution.  We then used a small selection of test cases to ensure basic
397functionality.  However, this is still a time consuming process, so more
398testing will carried out once the extraction of CIC terms to \ocaml{} programs
399is implemented in \matita.
400
401\section{Embedding invariants}
402
403Each phase of the prototype compiler can fail in a number of places if the
404input language permits programs that are badly structured in some sense: a
405missing label in a \texttt{goto} statement or CFG, an undefined variable name,
406a \texttt{break} statement outside of a loop or \texttt{switch}, and so on.
407We wish to restrict our intermediate languages using dependent types to remove
408as many `junk' programs as possible to rule out such failures.  We also hope
409that such restrictions will help in other correctness proofs.
410
411This goal lies in the overlap between several tasks in the project: it
412involves manipulating the syntax and semantics of the intermediate languages
413(the present work), the encoding of the front-end compiler phases in \matita{}
414(Task 3.2) and the correctness of the front-end (Task 3.4).  Thus this work is
415rather experimental; it is being carried out on branches in our source code
416repository and the final form will be decided and merged in during Task 3.4.
417
418So far we have tried adding two forms of invariant --- one using dependent
419types to index statements in \textsf{Cminor} by their block depth, and the
420other asserts that variables and labels are present in the appropriate
421environments by adding a separate invariant to each function.  Note that these
422do not yet cover all of the properties that a program in these languages is
423expected to enjoy; for example, there are currently no checks that references
424to globals are well-defined.
425
426\subsection{\textsf{Cminor} block depth}
427
428The \textsf{Cminor} language has relatively simple control structures.
429Statements are provided for infinite loops, non-looping blocks and
430exiting an arbitrary number of blocks (for \texttt{break}, failing loop guards
431and the switch statement\footnote{We are considering replacing the
432\textsf{Cminor} switch statement with one that uses \texttt{goto}-like labels
433in both the prototype and the formalized compilers, but for now we stick with
434this CompCert-style arrangement.}).
435
436However, this means that there are badly-formed \textsf{Cminor} programs such
437as
438\begin{lstlisting}[language={}]
439    int main() {
440      block {
441        loop {
442          exit 5
443        }
444      }
445    }
446\end{lstlisting}
447where we attempt to exit more blocks than exist.  To rule these out (including
448demonstrating that the previous phase of the compiler does not generate them)
449we can index the statements of the language by the depth of the enclosing
450blocks.
451
452The adaption of the syntax adds the depth to every statement, and uses bounded
453integers in the exit and switch statements:
454\begin{lstlisting}[language=matita]
455    inductive stmt : $\forall$blockdepth:nat. Type[0] ≝
456    | St_skip : $\forall$n. stmt n
457    ...
458    | St_loop : $\forall$n. stmt n $\rightarrow$ stmt n
459    | St_block : $\forall$n. stmt (S n) $\rightarrow$ stmt n
460    | St_exit : $\forall$n. Fin n $\rightarrow$ stmt n
461    (* expr to switch on, table of <switch value, #blocks to exit>, default *)
462    | St_switch : expr $\rightarrow$ $\forall$n. list (int $\times$ (Fin n)) $\rightarrow$ Fin n $\rightarrow$ stmt n
463    ...
464\end{lstlisting}
465where \texttt{stmt n} is a statement enclosed in \texttt{n} blocks,
466and \texttt{Fin n} is a standard construction for a natural number which is at
467most \texttt{n}.
468
469In the semantics the number of blocks is also added to the continuations and
470state, and the function to find the continuation from an exit statement can be
471made failure-free.
472We note in passing that adding this parameter detected a small mistake in the
473semantics concerning continuations and tail calls, although the mistake itself
474was benign.
475
476\subsection{Identifier invariants}
477
478To show that the variables and labels occurring in the body of a function
479are present in the relevant structures we add an additional invariant to the
480function records.
481
482For \textsf{Cminor} we use a higher-order predicate which recursively applies
483a predicate to all substatements:
484\begin{lstlisting}[language=matita]
485let rec stmt_P (P:stmt $\rightarrow$ Prop) (s:stmt) on s : Prop ≝
486match s with
487[ St_seq s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
488| St_ifthenelse _ _ _ s1 s2 $\Rightarrow$ P s $\wedge$ stmt_P P s1 $\wedge$ stmt_P P s2
489| St_loop s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
490| St_block s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
491| St_label _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
492| St_cost _ s' $\Rightarrow$ P s $\wedge$ stmt_P P s'
493| _ $\Rightarrow$ P s
494].
495\end{lstlisting}
496Dependent pattern matching on statements thus allows an accompanying
497\lstinline'stmt_P' fact to be unfold to the predicate on the current statement
498and the predicate applied to all substatements.
499
500We require two properties to hold in \textsf{Cminor} functions:
501\begin{enumerate}
502\item All variables in the body are present in the list of parameters or the
503list of variables for the function (this also uses a similar recursive
504predicate on expressions).
505\item All labels in \texttt{goto} statements appear in a label statement.
506\end{enumerate}
507The function definition thus becomes:
508\begin{lstlisting}[language=matita]
509record internal_function : Type[0] ≝
510{ f_return    : option typ
511; f_params    : list (ident $\times$ typ)
512; f_vars      : list (ident $\times$ typ)
513; f_stacksize : nat
514; f_body      : stmt
515; f_inv       : stmt_P ($\lambda$s.stmt_vars ($\lambda$i.Exists ? ($\lambda$x.\fst x = i) (f_params @ f_vars)) s $\wedge$
516                           stmt_labels ($\lambda$l.Exists ? ($\lambda$l'.l' = l) (labels_of f_body)) s) f_body
517}.
518\end{lstlisting}
519where \lstinline'stmt_vars' and \lstinline'stmt_labels' constrain the
520variables and labels that appear directly in a statement (but not
521substatements) to appear in the given list, and
522\lstinline'labels_of' returns a list of all the labels defined in a
523statement.
524
525The \textsf{Clight} semantics can be amended to use these invariants, although
526the main benefit is for the compiler stages (see the accompanying Deliverable
5273.2 for details).  The semantics require the invariants to be added to the
528state and continuations.  It was convenient to split the continuations between
529the local continuation representing the rest of the code to be executed within
530the function, and the stack of function calls because it becomes easier to
531state the property on the local continuation alone.
532  The invariant for variables is
533slightly different --- we require that every variable appear in the local
534environment.  We use \lstinline'f_inv' from the function to establish this
535invariant when the environment is set up on function entry.
536
537It is unclear whether changing the semantics is really worthwhile.  It
538witnesses that the invariants are those we wanted, but makes no difference to
539the actual execution of the program, especially as the execution can still
540fail due to genuine runtime errors.  Moreover, it is unclear what effect the
541presence of proof terms and more dependent pattern matching in the semantics
542will have on the complexity of future correctness proofs.  We plan to examine
543this issue during Task 3.4.
544
545We use a similar method to specify the invariant that the \textsf{RTLabs}
546graph is closed --- that is, any successor labels in a statement in the graph
547are present in the graph.  The definition is simpler in \textsf{RTLabs}
548because the flat representation of the graph does not require recursive
549definitions like \lstinline'stmt_P' above.
550
551\section{Conclusion}
552
553We have developed executable semantics for each of the front-end languages of
554the \cerco{} compiler.  These will form the basis of the correctness
555statements for each stage of the compiler in Task 3.4.  We have also shown
556that useful invariants can be added as dependent types, and intend to use
557these in subsequent work.
558
559\bibliographystyle{plain}
560\bibliography{report}
561
562\end{document}
Note: See TracBrowser for help on using the repository browser.