source: Deliverables/D3.3/Report/report.tex @ 1265

Last change on this file since 1265 was 1265, checked in by campbell, 9 years ago

Revisions to D3.3.

File size: 19.6 KB
Line 
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[mathletters]{ucs}
3\usepackage[utf8x]{inputenc}
4\usepackage{listings}
5\usepackage{../../style/cerco}
6\newcommand{\ocaml}{OCaml}
7\newcommand{\clight}{Clight}
8\newcommand{\matita}{Matita}
9\newcommand{\sdcc}{\texttt{sdcc}}
10
11\newcommand{\textSigma}{\ensuremath{\Sigma}}
12
13% LaTeX Companion, p 74
14\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
15
16\lstdefinelanguage{coq}
17  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
18   morekeywords={[2]if,then,else},
19  }
20
21\lstdefinelanguage{matita}
22  {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
23   morekeywords={[2]whd,normalize,elim,cases,destruct},
24   mathescape=true,
25   morecomment=[n]{(*}{*)},
26  }
27
28\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
29        keywordstyle=\color{red}\bfseries,
30        keywordstyle=[2]\color{blue},
31        commentstyle=\color{green},
32        stringstyle=\color{blue},
33        showspaces=false,showstringspaces=false}
34
35\lstset{extendedchars=false}
36\lstset{inputencoding=utf8x}
37\DeclareUnicodeCharacter{8797}{:=}
38\DeclareUnicodeCharacter{10746}{++}
39\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
40\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
41
42
43\title{
44INFORMATION AND COMMUNICATION TECHNOLOGIES\\
45(ICT)\\
46PROGRAMME\\
47\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
48
49\date{ }
50\author{}
51
52\begin{document}
53\thispagestyle{empty}
54
55\vspace*{-1cm}
56\begin{center}
57\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
58\end{center}
59
60\begin{minipage}{\textwidth}
61\maketitle
62\end{minipage}
63
64
65\vspace*{0.5cm}
66\begin{center}
67\begin{LARGE}
68\bf
69Report n. D3.3\\
70Executable Formal Semantics of front end intermediate languages\\
71\end{LARGE} 
72\end{center}
73
74\vspace*{2cm}
75\begin{center}
76\begin{large}
77Version 1.0
78\end{large}
79\end{center}
80
81\vspace*{0.5cm}
82\begin{center}
83\begin{large}
84Main Authors:\\
85Brian~Campbell
86\end{large}
87\end{center}
88
89\vspace*{\fill}
90\noindent
91Project Acronym: \cerco{}\\
92Project full title: Certified Complexity\\
93Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
94
95\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
96
97\newpage
98
99\vspace*{7cm}
100\paragraph{Abstract}
101We report on the formalization of the front end intermediate languages for the
102\cerco{} project's compiler using executable semantics in the \matita{} proof
103assistant.  This includes common definitions for fresh identifier handling,
104$n$-bit arithmetic and operations and testing of the smeantics.  We also
105consider embedding invariants into the semantics for showing correctness
106properties.
107
108\newpage 
109
110\tableofcontents
111
112% TODO: clear up any -ize vs -ise
113% TODO: clear up any "front end" vs "front-end"
114% TODO: clear up any mentions of languages that aren't textsf'd.
115% TODO: mention the restricted form of types after Clight
116% TODO: fix unicode in listings
117% TODO: say something about what CIC is, and why we're doing this.
118
119\section{Introduction}
120
121This work formalizes the front end intermediate languages from the \cerco{}
122prototype compiler, described in previous deliverables~\cite{d2.1,d2.2}.  The
123front end of the compiler is summarized in Figure~\ref{fig:summary} including
124the intermediate languages and the compiler passes described in the
125accompanying deliverable 3.2.  We have also refined parts of the formal
126development that are used for several of the languages in the compiler.
127
128The input language to the formalized front end is the \textsf{Clight}
129language.  The executable semantics for this language were presented in a
130previous deliverable~\cite{d3.1}.  Here we will report on some
131minor changes to its semantics made to better align it with the whole
132development.
133
134Finally, we will report on work to add several invariants to the
135languages.  This activity overlaps with task 3.4 on the correctness of the
136compiler front end.  However, the use of dependent types mean that this work
137is tied closely to the definition of the languages and the transformations of
138the front end in task 3.2.  By considering it now we can experiment with and
139judge its impact on the formal semantics, and how feasible retrofitting such
140invariants is.
141
142\begin{figure}
143\begin{center}
144\begin{minipage}{.8\linewidth}
145\begin{tabbing}
146\quad \= $\downarrow$ \quad \= \kill
147\textsf{C} (unformalized)\\
148\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
149\textsf{Clight}\\
150\> $\downarrow$ \> cast removal\\
151\> $\downarrow$ \> add runtime functions\\
152\> $\downarrow$ \> labelling\\
153\> $\downarrow$ \> stack variable allocation and control structure
154 simplification\\
155\textsf{Cminor}\\
156\> $\downarrow$ \> generate global variable initialisation code\\
157\> $\downarrow$ \> transform to RTL graph\\
158\textsf{RTLabs}\\
159\> $\downarrow$ \> start of target specific backend\\
160\>\quad \vdots
161\end{tabbing}
162\end{minipage}
163\end{center}
164\caption{Front end languages and transformations}
165\label{fig:summary}
166\end{figure}
167
168\subsection{Revisions to the prototype compiler}
169
170Ongoing work to maintain and improve the prototype compiler has resulted in
171several changes, mostly minor.  The most important change is that the
172transformations to replace 16 and 32 bit types have been moved from the
173\textsf{Clight} language to the target-specific stage between \textsf{RTLabs}
174and \textsf{RTL} to help generate better code, and the addition of a
175\textsf{Clight} cast removal transformation to reduce the number of 16 and 32
176bit operations.
177
178The formalized semantics have tracked these changes, and in this report we
179describe them as they currently stand.
180
181\section{Definitions common to several languages}
182
183The semantics for many of the languages in the compiler share some core parts:
184the memory model, environments, runtime values, definitions of operations and
185a monad for encapsulating failure and I/O (using resumptions).  The executable
186memory model was ported from CompCert as part of the work on \textsf{Clight}
187and was reused for the front end languages\footnote{However, it is likely that
188we will revise the memory model to make it better suited for describing the
189back-end of our compiler.}.  The failure and I/O monad was introduced in the
190previous deliverable on \textsf{Clight}~\cite[\S4.2]{d3.1}.  In all of the
191languages except \textsf{Clight} we have a basic form of type, \lstinline'typ'
192identifying integers and pointers along with their sizes.  The other parts
193are discussed below, with the only change to the runtime values being the
194representation of integers.
195
196\subsection{Identifiers}
197
198Each language features one or more kinds of names to represent variables, such
199as registers, \texttt{goto} labels or RTL graph labels.  We also need to
200describe various maps whose domain is a set of identifiers when defining the
201semantics and compilation.
202
203Previous work on the executable semantics of the target machine
204code included bit vectors and bit vector tries to define various integers in
205the semantics, and give a low level view of memory~\cite{d4.1}.  To keep the
206size of the development down we have reused these data structures for
207identifiers and maps respectively.
208
209One difficulty with using fixed size bit vectors for identifiers is that
210fresh name generation can fail if generate too many.  While we use an error
211monad to deal with failures, we wish to minimize its use in the compiler.
212Thus we add a flag to detect overflows, and check it after the phase of the
213compiler is complete to report exhaustion.  The rest of the phase can then be
214written as if name generation always succeeds.  In practice this will never
215occur on normal programs because more identifiers of each sort are available
216than bytes of code memory on the target.
217
218Given the wide variety of identifiers used in the compiler we also
219wish to separate the different classes of identifier.  Thus we encapsulate the
220bit vector representing the identifier in a datatype that also carries a tag
221identifying which class of identifier we are using:
222\begin{lstlisting}
223    inductive identifier (tag:String) : Type[0] ≝
224      an_identifier : Word → identifier tag.
225\end{lstlisting}
226The tries are also tagged in the same manner.  These tags have also proved
227useful during testing by making the resulting terms more readable.
228
229\subsection{Machine integers and arithmetic}
230
231The bit vectors in~\cite{d4.1} also came equipped with some basic arithmetic
232for the target semantics.  The front end required these operations to be
233generalized and extended.
234
235% TODO: expand, note need to sufficiently good performance for testing
236
237\subsection{Front end operations}
238
239The two front end intermediate languages, \textsf{Cminor} and \textsf{RTLabs},
240share the same set of operations on values.  They differ from
241\textsf{Clight}'s operations by incorporating casts and by having a separate
242operation for each type of data operated upon.  For example, subtraction of
243pointers is treated as a different operation from subtraction of integers.
244
245A common semantics is given for these operations in the form of simple
246CIC functions on the operation and runtime values.
247
248\section{\textsf{Clight} modifications}
249
250The \textsf{Clight} input language remained largely the same as in the
251previous deliverable~\cite{d3.1}.  The principal changes were to use the
252identifiers and arithmetic described above in place of the arbitrarily large
253integers used before.  For the identifiers, this relieved us of the burden of
254adding an efficient datatype for maps by reusing the bit vector tries instead.
255
256The arithmetic replaced a dependent pair of an arbitrary integer and a proof
257that it was in range of 32 bit integers by the exact bit vector for each
258size of integer.  This direct approach is closer to the implementation and
259more obviously correct --- no extra precision can be left in by accident.
260
261\section{\textsf{Cminor}}
262
263The \textsf{Cminor} language does not store local variables in memory, and has
264simpler control structures than \textsf{Clight}.  It is similar in nature to
265the \textsf{Cminor} language in CompCert, although the semantics have been
266based on the \cerco{} prototype rather than ported from CompCert.  The syntax
267is similar to the prototype, except that the types attached to expressions are
268restricted so that some corner cases are ruled out in the \textsf{Cminor} to
269\textsf{RTLabs} stage (see the accompanying deliverable 3.2 for details):
270\begin{lstlisting}
271    inductive expr : typ → Type[0] ≝
272    | Id : ∀t. ident → expr t
273    | Cst : ∀t. constant → expr t
274    | Op1 : ∀t,t'. unary_operation → expr t → expr t'
275    | Op2 : ∀t1,t2,t'. binary_operation → expr t1 → expr t2 → expr t'
276    | Mem : ∀t,r. memory_chunk → expr (ASTptr r) → expr t
277    | Cond : ∀sz,sg,t. expr (ASTint sz sg) → expr t → expr t → expr t
278    | Ecost : ∀t. costlabel → expr t → expr t.
279\end{lstlisting}
280For example, note that conditional expressions only switch on integer
281expressions.
282In principle we could extend this to statically ensure that only well-typed
283\textsf{Cminor} expressions are considered, and we will consider this as part
284of the work on correctness in task 3.4.
285
286We also provide a variant of the syntax where the only initialization data is
287the size of each global variable, for use after the initialization code has
288been generated.
289
290The definition of the semantics is routine: a functional definition of a
291single small step of the machine is given, reusing the memory model,
292environments, arithmetic and operations mentioned above.
293
294\section{\textsf{RTLabs}}
295
296The \textsf{RTLabs} language provides a target independent Register Transfer
297Language, where programs are represented as control flow graphs.  We use the
298identifiers described above for the graph labels and the maps for the graph
299itself.  The tagging mechanism ensures that labels cannot be mixed up with
300other identifiers in the program (in particular, we cannot accidentally reuse
301a \texttt{goto} label from Cminor where a graph label should appear).
302
303Otherwise, the syntax and semantics of \textsf{RTLabs} mirrors that of the
304prototype compiler.  The same common elements are used as for \textsf{Cminor},
305including the front end operations mentioned above.
306
307% TODO: give fragment of syntax, in particular the type of the graph
308
309\section{Testing}
310
311To provide some assurance that the semantics were properly implemented, and to
312support the testing described in the accompanying deliverable 3.2, we have
313adapted the pretty printers in the prototype compiler to produce \matita{}
314terms for the syntax of each language described above.
315
316A few common definitions were added for animating the small step semantics
317definitions for any of the front end languages in \matita.  We then used a
318small selection of test cases to ensure basic functionality.  However, this is
319still a time consuming process, so more testing will carried out once the
320extraction of CIC terms to \ocaml{} programs is implemented in \matita.
321
322\section{Embedding invariants}
323
324Each phase of the prototype compiler can fail in a number of places if the
325input language permits programs that are badly structured in some sense: a
326missing label in a \texttt{goto} statement or CFG, an undefined variable name,
327a \texttt{break} statement outside of a loop or \texttt{switch}, and so on.
328We wish to restrict our intermediate languages using dependent types to remove
329as many `junk' programs as possible to rule out such failures.  We also hope
330that such restrictions will help in other correctness proofs.
331
332This goal lies in the overlap between several tasks in the project: it
333involves manipulating the syntax and semantics of the intermediate languages
334(the present work), the encoding of the front-end compiler phases in \matita{}
335(task 3.2) and the correctness of the front-end (task 3.4).  Thus this work is
336rather experimental; it is being carried out on branches in our source code
337repository and the final form will be decided and merged in during task 3.4.
338
339So far we have tried adding two forms of invariant --- one using dependent
340types to index statements in \textsf{Cminor} by their block depth, and the
341other asserts that variables and labels are present in the appropriate
342environments by adding a separate invariant to each function.  Note that these
343do not yet cover all of the properties that a program in these languages is
344expected to enjoy; for example, there are currently no checks that references
345to globals are well-defined.
346
347\subsection{\textsf{Cminor} block depth}
348
349The \textsf{Cminor} language has relatively simple control structures.
350Statements are provided for infinite loops, non-looping blocks and
351exiting an arbitrary number of blocks (for \texttt{break}, failing loop guards
352and the switch statement\footnote{We are considering replacing the
353\textsf{Cminor} switch statement with one that uses \texttt{goto}-like labels
354in both the prototype and the formalized compilers, but for now we stick with
355this CompCert-style arrangement.}).
356
357However, this means that there are badly-formed \textsf{Cminor} programs such
358as
359\begin{lstlisting}[language={}]
360    int main() {
361      block {
362        loop {
363          exit 5
364        }
365      }
366    }
367\end{lstlisting}
368where we attempt to exit more blocks than exist.  To rule these out (including
369demonstrating that the previous phase of the compiler does not generate them)
370we can index the statements of the language by the depth of the enclosing
371blocks.
372
373The adaption of the syntax adds the depth to every statement, and uses bounded
374integers in the exit and switch statements:
375\begin{lstlisting}[language=matita]
376    inductive stmt : ∀blockdepth:nat. Type[0] ≝
377    | St_skip : ∀n. stmt n
378    ...
379    | St_loop : ∀n. stmt n → stmt n
380    | St_block : ∀n. stmt (S n) → stmt n
381    | St_exit : ∀n. Fin n → stmt n
382    (* expr to switch on, table of <switch value, #blocks to exit>, default *)
383    | St_switch : expr → ∀n. list (int × (Fin n)) → Fin n → stmt n
384    ...
385\end{lstlisting}
386where \texttt{stmt n} is a statement enclosed in \texttt{n} blocks,
387and \texttt{Fin n} is a standard construction for a natural number which is at
388most \texttt{n}.
389
390In the semantics the number of blocks is also added to the continuations and
391state, and the function to find the continuation from an exit statement can be
392made failure-free.
393We note in passing that adding this parameter detected a small mistake in the
394semantics concerning continuations and tail calls, although the mistake itself
395was benign.
396
397\subsection{Identifier invariants}
398
399To show that the variables and labels occurring in the body of a function
400are present in the relevant structures we add an additional invariant to the
401function records.
402
403For \textsf{Cminor} we use a higher-order predicate which recursively applies
404a predicate to all substatements:
405\begin{lstlisting}[language=matita]
406let rec stmt_P (P:stmt → Prop) (s:stmt) on s : Prop ≝
407match s with
408[ St_seq s1 s2 ⇒ P s ∧ stmt_P P s1 ∧ stmt_P P s2
409| St_ifthenelse _ _ _ s1 s2 ⇒ P s ∧ stmt_P P s1 ∧ stmt_P P s2
410| St_loop s' ⇒ P s ∧ stmt_P P s'
411| St_block s' ⇒ P s ∧ stmt_P P s'
412| St_label _ s' ⇒ P s ∧ stmt_P P s'
413| St_cost _ s' ⇒ P s ∧ stmt_P P s'
414| _ ⇒ P s
415].
416\end{lstlisting}
417Dependent pattern matching on statements thus allows an accompanying
418\lstinline'stmt_P' fact to be unfold to the predicate on the current statement
419and the predicate applied to all substatements.
420
421We require two properties to hold in \textsf{Cminor} functions:
422\begin{enumerate}
423\item All variables in the body are present in the list of parameters or the
424list of variables for the function (this also uses a similar recursive
425predicate on expressions).
426\item All labels in \texttt{goto} statements appear in a label statement.
427\end{enumerate}
428The function definition thus becomes:
429\begin{lstlisting}[language=matita]
430record internal_function : Type[0] ≝
431{ f_return    : option typ
432; f_params    : list (ident × typ)
433; f_vars      : list (ident × typ)
434; f_stacksize : nat
435; f_body      : stmt
436; f_inv       : stmt_P (λs.stmt_vars (λi.Exists ? (λx.\fst x = i) (f_params @ f_vars)) s ∧
437                           stmt_labels (λl.Exists ? (λl'.l' = l) (labels_of f_body)) s) f_body
438}.
439\end{lstlisting}
440where \lstinline'stmt_vars' and \lstinline'stmt_labels' constrain the
441variables and labels that appear directly in a statement (but not
442substatements) to appear in the given list, and
443\lstinline'labels_of' returns a list of all the labels defined in a
444statement.
445
446The \textsf{Clight} semantics can be amended to use these invariants, although
447the main benefit is for the compiler stages (see the accompanying deliverable
4483.2 for details).  The semantics require the invariants to be added to the
449state and continuations.  It was convenient to split the continuations between
450the local continuation representing the rest of the code to be executed within
451the function, and the stack of function calls.
452% TODO: because???
453  The invariant for variables is
454slightly different --- we require that every variable appear in the local
455environment.  We use \lstinline'f_inv' from the function to establish this
456invariant when the environment is set up on function entry.
457
458It is unclear whether changing the semantics is really worthwhile.  It
459witnesses that the invariants are those we wanted, but makes no difference to
460the actual execution of the program, especially as the execution can still
461fail due to genuine runtime errors.  Moreover, it is unclear what effect the
462presence of proof terms and more dependent pattern matching in the semantics
463will have on the complexity of future correctness proofs.  We plan to examine
464this issue during task 3.4.
465
466We use a similar method to specify the invariant that the \textsf{RTLabs}
467graph is closed --- that is, any successor labels in a statement in the graph
468are present in the graph.  The definition is simpler in \textsf{RTLabs}
469because the flat representation of the graph does not require recursive
470definitions like \lstinline'stmt_P' above.
471
472\section{Conclusion}
473
474\bibliographystyle{plain}
475\bibliography{report}
476
477\end{document}
Note: See TracBrowser for help on using the repository browser.