1 | \documentclass[11pt,epsf,a4wide]{article} |
---|
2 | \usepackage[mathletters]{ucs} |
---|
3 | \usepackage[utf8x]{inputenc} |
---|
4 | \usepackage{listings} |
---|
5 | \usepackage{../../style/cerco} |
---|
6 | \newcommand{\ocaml}{OCaml} |
---|
7 | \newcommand{\clight}{Clight} |
---|
8 | \newcommand{\matita}{Matita} |
---|
9 | \newcommand{\sdcc}{\texttt{sdcc}} |
---|
10 | |
---|
11 | \newcommand{\textSigma}{\ensuremath{\Sigma}} |
---|
12 | |
---|
13 | % LaTeX Companion, p 74 |
---|
14 | \newcommand{\todo}[1]{\marginpar{\raggedright - #1}} |
---|
15 | |
---|
16 | \lstdefinelanguage{coq} |
---|
17 | {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record}, |
---|
18 | morekeywords={[2]if,then,else}, |
---|
19 | } |
---|
20 | |
---|
21 | \lstdefinelanguage{matita} |
---|
22 | {keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on}, |
---|
23 | morekeywords={[2]whd,normalize,elim,cases,destruct}, |
---|
24 | mathescape=true, |
---|
25 | morecomment=[n]{(*}{*)}, |
---|
26 | } |
---|
27 | |
---|
28 | \lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false, |
---|
29 | keywordstyle=\color{red}\bfseries, |
---|
30 | keywordstyle=[2]\color{blue}, |
---|
31 | commentstyle=\color{green}, |
---|
32 | stringstyle=\color{blue}, |
---|
33 | showspaces=false,showstringspaces=false} |
---|
34 | |
---|
35 | \lstset{extendedchars=false} |
---|
36 | \lstset{inputencoding=utf8x} |
---|
37 | \DeclareUnicodeCharacter{8797}{:=} |
---|
38 | \DeclareUnicodeCharacter{10746}{++} |
---|
39 | \DeclareUnicodeCharacter{9001}{\ensuremath{\langle}} |
---|
40 | \DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}} |
---|
41 | |
---|
42 | |
---|
43 | \title{ |
---|
44 | INFORMATION AND COMMUNICATION TECHNOLOGIES\\ |
---|
45 | (ICT)\\ |
---|
46 | PROGRAMME\\ |
---|
47 | \vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}} |
---|
48 | |
---|
49 | \date{ } |
---|
50 | \author{} |
---|
51 | |
---|
52 | \begin{document} |
---|
53 | \thispagestyle{empty} |
---|
54 | |
---|
55 | \vspace*{-1cm} |
---|
56 | \begin{center} |
---|
57 | \includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png} |
---|
58 | \end{center} |
---|
59 | |
---|
60 | \begin{minipage}{\textwidth} |
---|
61 | \maketitle |
---|
62 | \end{minipage} |
---|
63 | |
---|
64 | |
---|
65 | \vspace*{0.5cm} |
---|
66 | \begin{center} |
---|
67 | \begin{LARGE} |
---|
68 | \bf |
---|
69 | Report n. D3.3\\ |
---|
70 | Executable Formal Semantics of front end intermediate languages\\ |
---|
71 | \end{LARGE} |
---|
72 | \end{center} |
---|
73 | |
---|
74 | \vspace*{2cm} |
---|
75 | \begin{center} |
---|
76 | \begin{large} |
---|
77 | Version 1.0 |
---|
78 | \end{large} |
---|
79 | \end{center} |
---|
80 | |
---|
81 | \vspace*{0.5cm} |
---|
82 | \begin{center} |
---|
83 | \begin{large} |
---|
84 | Main Authors:\\ |
---|
85 | Brian~Campbell |
---|
86 | \end{large} |
---|
87 | \end{center} |
---|
88 | |
---|
89 | \vspace*{\fill} |
---|
90 | \noindent |
---|
91 | Project Acronym: \cerco{}\\ |
---|
92 | Project full title: Certified Complexity\\ |
---|
93 | Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\ |
---|
94 | |
---|
95 | \clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881} |
---|
96 | |
---|
97 | \newpage |
---|
98 | |
---|
99 | \vspace*{7cm} |
---|
100 | \paragraph{Abstract} |
---|
101 | We report on the formalization of the front end intermediate languages for the |
---|
102 | \cerco{} project's compiler using executable semantics in the \matita{} proof |
---|
103 | assistant. |
---|
104 | |
---|
105 | \newpage |
---|
106 | |
---|
107 | \tableofcontents |
---|
108 | |
---|
109 | % TODO: clear up any -ize vs -ise |
---|
110 | % TODO: clear up any "front end" vs "front-end" |
---|
111 | % TODO: clear up any mentions of languages that aren't textsf'd. |
---|
112 | % TODO: mention the restricted form of types after Clight |
---|
113 | % TODO: fix unicode in listings |
---|
114 | |
---|
115 | \section{Introduction} |
---|
116 | |
---|
117 | This work formalizes the front end intermediate languages from the \cerco{} |
---|
118 | prototype compiler, described in previous deliverables~\cite{d2.1,d2.2}. The |
---|
119 | front end of the compiler is summarized in Figure~\ref{fig:summary}. We |
---|
120 | have also refined parts of the formal development that are used for several of |
---|
121 | the languages in the compiler. |
---|
122 | |
---|
123 | We will also report on work to add several invariants to the languages. This |
---|
124 | activity overlaps with task 3.4 on the correctness of the compiler front end. |
---|
125 | However, the use of dependent types mean that this work is tied closely to the |
---|
126 | definition of the languages and the transformations of the front end in task |
---|
127 | 3.2. By considering it now we can experiment with and judge its impact on the |
---|
128 | formal semantics, and how feasible retrofitting such invariants is. |
---|
129 | |
---|
130 | The input language to the formalized front end is the \textsf{Clight} |
---|
131 | language. The executable semantics for this language were presented in a |
---|
132 | previous deliverable~\cite{d3.1}. In the present work we will report on some |
---|
133 | minor changes to the semantics made to better align it with the whole |
---|
134 | development. |
---|
135 | |
---|
136 | \begin{figure} |
---|
137 | \begin{tabbing} |
---|
138 | \quad \= $\downarrow$ \quad \= \kill |
---|
139 | \textsf{C} (unformalized)\\ |
---|
140 | \> $\downarrow$ \> CIL parser (unformalized \ocaml)\\ |
---|
141 | \textsf{Clight}\\ |
---|
142 | \> $\downarrow$ \> cast removal\\ |
---|
143 | \> $\downarrow$ \> add runtime functions\\ |
---|
144 | \> $\downarrow$ \> labelling\\ |
---|
145 | \> $\downarrow$ \> stack variable allocation and control structure |
---|
146 | simplification\\ |
---|
147 | \textsf{Cminor}\\ |
---|
148 | \> $\downarrow$ \> generate global variable initialisation code\\ |
---|
149 | \> $\downarrow$ \> transform to RTL graph\\ |
---|
150 | \textsf{RTLabs}\\ |
---|
151 | \> $\downarrow$ \> start of target specific backend\\ |
---|
152 | \ \ \dots |
---|
153 | \end{tabbing} |
---|
154 | \caption{Front end languages and transformations} |
---|
155 | \label{fig:summary} |
---|
156 | \end{figure} |
---|
157 | |
---|
158 | \subsection{Revisions to the prototype compiler} |
---|
159 | |
---|
160 | Ongoing work to maintain and improve the prototype compiler resulted in |
---|
161 | several changes. Most importantly, the transformations to replace 16 and 32 |
---|
162 | bit types have been moved from the \textsf{Clight} language to the |
---|
163 | target-specific stage between \textsf{RTLabs} and \textsf{RTL} to help |
---|
164 | generate better code, and the addition of a \textsf{Clight} cast removal |
---|
165 | transformation to reduce the number of 16 and 32 bit operations. |
---|
166 | |
---|
167 | The formalized semantics have tracked these changes, and in this report we |
---|
168 | describe them as they currently stand. |
---|
169 | |
---|
170 | \section{Definitions common to several languages} |
---|
171 | |
---|
172 | The semantics for many of the languages in the compiler share some core parts: |
---|
173 | the memory model, environments, runtime values, definitions of operations and |
---|
174 | a monad for encapsulating failure and I/O (using resumptions). The executable |
---|
175 | memory model was ported from CompCert as part of the work on \textsf{Clight} |
---|
176 | and was reused for the front end languages\footnote{However, it is likely that |
---|
177 | we will revise the memory model to make it better suited for describing the |
---|
178 | back-end of our compiler.}. The failure and I/O monad was introduced in the |
---|
179 | previous deliverable on \textsf{Clight}~\cite[\S4.2]{d3.1}. The other parts |
---|
180 | are discussed below, with the only change to the runtime values being the |
---|
181 | representation of integers. |
---|
182 | |
---|
183 | \subsection{Identifiers} |
---|
184 | |
---|
185 | Each language features one or more kinds of names to represent variables, |
---|
186 | registers, \texttt{goto} labels and/or RTL graph labels. We also need to |
---|
187 | describe various maps keyed on identifiers in the semantics and during |
---|
188 | compilation. |
---|
189 | |
---|
190 | Previous work on the executable semantics of the target machine |
---|
191 | code included bit vectors and bit vector tries to define various integers in |
---|
192 | the semantics, and give a low level view of memory~\cite{d4.1}. To keep the |
---|
193 | size of the development down we have reused these data structures for |
---|
194 | identifiers and maps respectively. |
---|
195 | |
---|
196 | However, given the wide variety of identifiers used in the compiler we also |
---|
197 | wish to separate the different classes of identifier. Thus we encapsulate the |
---|
198 | bit vector representing the identifier in a datatype that also carries a tag |
---|
199 | identifying which class of identifier we are using: |
---|
200 | \begin{lstlisting} |
---|
201 | inductive identifier (tag:String) : Type[0] ≝ |
---|
202 | an_identifier : Word → identifier tag. |
---|
203 | \end{lstlisting} |
---|
204 | The tries are also tagged in the same manner. |
---|
205 | |
---|
206 | One difficultly with using fixed size bit vectors for identifiers is that |
---|
207 | fresh name generation can fail if generate too many. While we use an error |
---|
208 | monad to deal with failures, we wish to minimize its use in the compiler. |
---|
209 | Thus we add a flag to detect overflows, and check it after the phase of the |
---|
210 | compiler is complete to report exhaustion. The rest of the phase can then be |
---|
211 | written as if name generation always succeeds. |
---|
212 | |
---|
213 | \subsection{Machine integers and arithmetic} |
---|
214 | |
---|
215 | The bit vectors in~\cite{d4.1} also came equipped with some basic arithmetic |
---|
216 | for the target semantics. The front end required these operations to be |
---|
217 | generalized and extended. |
---|
218 | |
---|
219 | % TODO: expand, note need to sufficiently good performance for testing |
---|
220 | |
---|
221 | \subsection{Front end operations} |
---|
222 | |
---|
223 | The two front end intermediate languages, \textsf{Cminor} and \textsf{RTLabs}, |
---|
224 | share the same set of operations on values. They differ from |
---|
225 | \textsf{Clight}'s operations by incorporating casts and by having a separate |
---|
226 | operation for each type of data operated upon. For example, subtraction of |
---|
227 | pointers is treated as a different operation from subtraction of integers. |
---|
228 | |
---|
229 | A common semantics is given for these operations, as simple functions on the |
---|
230 | operation and runtime values. |
---|
231 | |
---|
232 | \section{Clight} |
---|
233 | |
---|
234 | The \textsf{Clight} input language remained largely the same as in the |
---|
235 | previous deliverable~\cite{d3.1}. The principal changes were to use the |
---|
236 | identifiers and arithmetic described above in place of the arbitrarily large |
---|
237 | integers used above. For the identifiers, this relieved us of the burden of |
---|
238 | adding an efficient datatype for maps by reusing the bit vector tries instead. |
---|
239 | |
---|
240 | The arithmetic replaced a dependent pair of an arbitrary integer and proof |
---|
241 | that it was in range of 32 bit integers by the exact bit vector for each |
---|
242 | size of integer. This direct approach is closer to the implementation and |
---|
243 | more obviously correct --- no extra precision can be left in by accident. |
---|
244 | |
---|
245 | \section{Cminor} |
---|
246 | |
---|
247 | The \textsf{Cminor} language does not store local variables in memory, and has |
---|
248 | simpler control structures than \textsf{Clight}. The syntax is similar to the |
---|
249 | prototype, except that the types attached to expressions are restricted so |
---|
250 | that some corner cases are ruled out in the \textsf{Cminor} to \textsf{RTLabs} |
---|
251 | stage (see the accompanying deliverable 3.2 for details): |
---|
252 | \begin{lstlisting} |
---|
253 | inductive expr : typ → Type[0] ≝ |
---|
254 | | Id : ∀t. ident → expr t |
---|
255 | | Cst : ∀t. constant → expr t |
---|
256 | | Op1 : ∀t,t'. unary_operation → expr t → expr t' |
---|
257 | | Op2 : ∀t1,t2,t'. binary_operation → expr t1 → expr t2 → expr t' |
---|
258 | | Mem : ∀t,r. memory_chunk → expr (ASTptr r) → expr t |
---|
259 | | Cond : ∀sz,sg,t. expr (ASTint sz sg) → expr t → expr t → expr t |
---|
260 | | Ecost : ∀t. costlabel → expr t → expr t. |
---|
261 | \end{lstlisting} |
---|
262 | In principle we could extend this to statically ensure that only well-typed |
---|
263 | \textsf{Cminor} expressions are considered, and we will consider this as part |
---|
264 | of the work on correctness in task 3.4. |
---|
265 | |
---|
266 | We also provide a variant of the syntax where the only initialization data is |
---|
267 | the size of each global variable, for use after the initialization code has |
---|
268 | been generated. |
---|
269 | |
---|
270 | The definition of the semantics is routine: a functional definition of a |
---|
271 | single small step of the machine is given, reusing the memory model, |
---|
272 | environments, arithmetic and operations mentioned above. |
---|
273 | |
---|
274 | \section{RTLabs} |
---|
275 | |
---|
276 | The \textsf{RTLabs} language provides a target independent Register Transfer |
---|
277 | Language, where programs are represented as control flow graphs. We use the |
---|
278 | identifiers described above for the graph labels and the maps for the graph |
---|
279 | itself. The tagging mechanism ensures that labels cannot be mixed up with |
---|
280 | other identifiers in the program (in particular, we cannot accidentally reuse |
---|
281 | a \texttt{goto} label from Cminor where a graph label should appear). |
---|
282 | |
---|
283 | Otherwise, the syntax and semantics of RTLabs mirrors that of the prototype |
---|
284 | compiler. The same common elements are used as for \textsf{Cminor}, including |
---|
285 | the front end operations mentioned above. |
---|
286 | |
---|
287 | \section{Testing} |
---|
288 | |
---|
289 | To provide some assurance that the semantics were properly implemented, and to |
---|
290 | support the testing described in the accompanying deliverable 3.2, we have |
---|
291 | adapted the pretty printers in the prototype compiler to produce \matita{} |
---|
292 | terms in the syntax for each language described above. |
---|
293 | |
---|
294 | A few common definitions were added for animating the small step semantics |
---|
295 | definitions for any of the front end languages in \matita. We then used a |
---|
296 | small selection of test cases to ensure basic functionality. However, this is |
---|
297 | still a time consuming process, so more testing will carried out once the |
---|
298 | extraction to \ocaml{} programs is implemented in \matita. |
---|
299 | |
---|
300 | \section{Embedding invariants} |
---|
301 | |
---|
302 | Each phase of the prototype compiler can fail in a number of places if the |
---|
303 | input language permits programs that are badly structured in some sense: a |
---|
304 | missing label in a \texttt{goto} statement or CFG, an undefined variable name, |
---|
305 | a \texttt{break} statement outside of a loop or \texttt{switch}, and so on. |
---|
306 | We wish to restrict our intermediate languages using dependent types to remove |
---|
307 | as many `junk' programs as possible. We also hope that such restrictions will |
---|
308 | help in other correctness proofs. |
---|
309 | |
---|
310 | This goal lies in the overlap between several tasks in the project: it |
---|
311 | involves manipulating the syntax and semantics of the intermediate languages |
---|
312 | (the present work), the encoding of the front-end compiler phases in \matita{} |
---|
313 | (task 3.2) and the correctness of the front-end (task 3.4). Thus this work is |
---|
314 | rather experimental; it is being carried out on branches in our source code |
---|
315 | repository and the final form will be decided and merged in during task 3.4. |
---|
316 | |
---|
317 | So far we have tried adding two forms of invariant --- one by using dependent |
---|
318 | types to index statements in \textsf{Cminor} by their block depth, and the |
---|
319 | other asserts that variables and labels are present in the appropriate |
---|
320 | environments by adding a separate invariant to each function. |
---|
321 | |
---|
322 | \subsection{Cminor block depth} |
---|
323 | |
---|
324 | The \textsf{Cminor} language simplifies loops with a single infinite loop |
---|
325 | construct and the switch statement does not contain the individual cases. |
---|
326 | Instead, statements are provided for infinite loops, non-looping blocks and |
---|
327 | exiting an arbitrary number of blocks (which is also what the switch statement |
---|
328 | does\footnote{We are considering replacing the \textsf{Cminor} switch |
---|
329 | statement with one that uses \texttt{goto}-like labels in both the prototype |
---|
330 | and the formalized compilers, but for now we stick with this CompCert-like |
---|
331 | arrangement.}. |
---|
332 | |
---|
333 | However, this means that there are badly-formed \textsf{Cminor} programs such |
---|
334 | as |
---|
335 | \begin{lstlisting}[language={}] |
---|
336 | int main() { |
---|
337 | block { |
---|
338 | exit 5 |
---|
339 | } |
---|
340 | } |
---|
341 | \end{lstlisting} |
---|
342 | where we attempt to exit more blocks than exist. To rule these out (including |
---|
343 | demonstrating that the previous phase of the compiler does not generate them) |
---|
344 | we can index the statements of the language by the depth of the enclosing |
---|
345 | blocks. |
---|
346 | |
---|
347 | The adaption of the syntax adds the depth to every statement, and uses bounded |
---|
348 | integers in the exit and switch statements: |
---|
349 | \begin{lstlisting}[language=matita] |
---|
350 | inductive stmt : ∀blockdepth:nat. Type[0] ≝ |
---|
351 | | St_skip : ∀n. stmt n |
---|
352 | ... |
---|
353 | | St_loop : ∀n. stmt n → stmt n |
---|
354 | | St_block : ∀n. stmt (S n) → stmt n |
---|
355 | | St_exit : ∀n. Fin n → stmt n |
---|
356 | (* expr to switch on, table of <switch value, #blocks to exit>, default *) |
---|
357 | | St_switch : expr → ∀n. list (int × (Fin n)) → Fin n → stmt n |
---|
358 | ... |
---|
359 | \end{lstlisting} |
---|
360 | where \texttt{stmt n} is a statement enclosed in \texttt{n} blocks or loops, |
---|
361 | and \texttt{Fin n} is a standard construction for a natural number which is at |
---|
362 | most \texttt{n}. |
---|
363 | |
---|
364 | In the semantics the number of blocks is also added to the continuations and |
---|
365 | state, and the function to find the continuation from an exit statement can be |
---|
366 | made failure-free. |
---|
367 | |
---|
368 | We note in passing that adding this parameter detected a small mistake in the |
---|
369 | semantics concerning continuations and tail calls, although the mistake itself |
---|
370 | was benign. |
---|
371 | |
---|
372 | \subsection{Identifier invariants} |
---|
373 | |
---|
374 | For showing that the variables and labels occurring in the body of a function |
---|
375 | are present in the relevant structures are add an additional invariant to the |
---|
376 | function records. |
---|
377 | |
---|
378 | For \textsf{Cminor} we add a higher-order predicate which recursively applies |
---|
379 | a predicate to all substatements: |
---|
380 | \begin{lstlisting}[language=matita] |
---|
381 | let rec stmt_P (P:stmt → Prop) (s:stmt) on s : Prop ≝ |
---|
382 | match s with |
---|
383 | [ St_seq s1 s2 ⇒ P s ∧ stmt_P P s1 ∧ stmt_P P s2 |
---|
384 | | St_ifthenelse _ _ _ s1 s2 ⇒ P s ∧ stmt_P P s1 ∧ stmt_P P s2 |
---|
385 | | St_loop s' ⇒ P s ∧ stmt_P P s' |
---|
386 | | St_block s' ⇒ P s ∧ stmt_P P s' |
---|
387 | | St_label _ s' ⇒ P s ∧ stmt_P P s' |
---|
388 | | St_cost _ s' ⇒ P s ∧ stmt_P P s' |
---|
389 | | _ ⇒ P s |
---|
390 | ]. |
---|
391 | \end{lstlisting} |
---|
392 | Dependent pattern matching on statements thus allows \lstinline'stmt_P' to |
---|
393 | unfold to the predicate on the current statement as well as the predicate |
---|
394 | applied to all substatements. |
---|
395 | |
---|
396 | We use two properties in \textsf{Cminor}: |
---|
397 | \begin{enumerate} |
---|
398 | \item All variables are present in the list of parameters or the list of |
---|
399 | variables for the function (this also uses a similar recursive predicate on |
---|
400 | expressions). |
---|
401 | \item All labels in \texttt{goto} statements appear in a label statement. |
---|
402 | \end{enumerate} |
---|
403 | The function definition thus becomes: |
---|
404 | \begin{lstlisting}[language=matita] |
---|
405 | record internal_function : Type[0] ≝ |
---|
406 | { f_return : option typ |
---|
407 | ; f_params : list (ident × typ) |
---|
408 | ; f_vars : list (ident × typ) |
---|
409 | ; f_stacksize : nat |
---|
410 | ; f_body : stmt |
---|
411 | ; f_inv : stmt_P (λs.stmt_vars (λi.Exists ? (λx.\fst x = i) (f_params @ f_vars)) s ∧ |
---|
412 | stmt_labels (λl.Exists ? (λl'.l' = l) (labels_of f_body)) s) f_body |
---|
413 | }. |
---|
414 | \end{lstlisting} |
---|
415 | where \lstinline'stmt_vars' and \lstinline'stmt_labels' return the |
---|
416 | variables and labels that appear directly in a statement (but not |
---|
417 | substatements), and |
---|
418 | \lstinline'labels_of' returns a list of all the labels defined in a |
---|
419 | statement. |
---|
420 | |
---|
421 | The \textsf{Clight} semantics can be amended to use these invariants, although |
---|
422 | the main benefit is for the compiler stages (see the accompanying deliverable |
---|
423 | 3.2 for details). The semantics require the invariants to be added to the |
---|
424 | state and continuations. It was convenient to split the continuations between |
---|
425 | the local continuation representing the rest of the code to be executed within |
---|
426 | the function, and the stack of function calls. The invariant for variables is |
---|
427 | slightly different --- we require that every variable appear in the local |
---|
428 | environment. We use \lstinline'f_inv' from the function to establish this |
---|
429 | invariant when the environment is set up on function entry. |
---|
430 | |
---|
431 | It is unclear whether changing the semantics is really worthwhile. It |
---|
432 | witnesses that the invariants are those we wanted, but makes no difference to |
---|
433 | the actual execution of the program, especially as the execution can still |
---|
434 | fail due to runtime errors. Moreover, it is unclear what effect the presence |
---|
435 | of proof terms and more dependent pattern matching in the semantics will have |
---|
436 | on the complexity of future correctness proofs. We plan to examine this issue |
---|
437 | during task 3.4. |
---|
438 | |
---|
439 | We use a similar method to specify the invariant that the \textsf{RTLabs} |
---|
440 | graph is closed --- that is, any successor labels in a statement in the graph |
---|
441 | are present in the graph. The definition is simpler in \textsf{RTLabs} |
---|
442 | because the flat representation of the graph does not require recursive |
---|
443 | definitions like \lstinline'stmt_P' above. |
---|
444 | |
---|
445 | \section{Conclusion} |
---|
446 | |
---|
447 | \bibliographystyle{plain} |
---|
448 | \bibliography{report} |
---|
449 | |
---|
450 | \end{document} |
---|