# source:Deliverables/D3.1/Report/report.tex@207

Last change on this file since 207 was 207, checked in by campbell, 9 years ago

Add memory extensions and rework parts of D3.1.

File size: 23.1 KB
Line
1\documentclass[11pt,epsf,a4wide]{article}
2\usepackage[utf8x]{inputenc}
3\usepackage{listings}
4\usepackage{../../style/cerco}
5\newcommand{\ocaml}{OCaml}
6\newcommand{\clight}{Clight}
7\newcommand{\matita}{matita}
8\newcommand{\sdcc}{\texttt{sdcc}}
9
10\newcommand{\textSigma}{\ensuremath{\Sigma}}
11
12% LaTeX Companion, p 74
13\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
14
15\lstdefinelanguage{coq}
16  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
17   morekeywords={[2]if,then,else},
18  }
19
20\lstdefinelanguage{matita}
21  {keywords={ndefinition,nlemma,ntheorem,nremark,ninductive,nrecord,nqed,nlet,rec,match,with,Type},
22   morekeywords={[2]nwhd,nnormalize,nelim,ncases,ndestruct},
23   mathescape=true,
24  }
25
26\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
27        keywordstyle=\color{red}\bfseries,
28        keywordstyle=[2]\color{blue},
30        stringstyle=\color{blue},
31        showspaces=false,showstringspaces=false}
32
33\lstset{extendedchars=false}
34\lstset{inputencoding=utf8x}
35\DeclareUnicodeCharacter{8797}{:=}
36\DeclareUnicodeCharacter{10746}{++}
37\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
38\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
39
40
41\title{
42INFORMATION AND COMMUNICATION TECHNOLOGIES\\
43(ICT)\\
44PROGRAMME\\
45\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
46
47\date{ }
48\author{}
49
50\begin{document}
51\thispagestyle{empty}
52
53\vspace*{-1cm}
54\begin{center}
55\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
56\end{center}
57
58\begin{minipage}{\textwidth}
59\maketitle
60\end{minipage}
61
62
63\vspace*{0.5cm}
64\begin{center}
65\begin{LARGE}
66\bf
67Report n. D3.1\\
68Executable Formal Semantics of C\\
69\end{LARGE}
70\end{center}
71
72\vspace*{2cm}
73\begin{center}
74\begin{large}
75Version 1.0
76\end{large}
77\end{center}
78
79\vspace*{0.5cm}
80\begin{center}
81\begin{large}
82Main Authors:\\
83Brian~Campbell, Randy~Pollack
84\end{large}
85\end{center}
86
87\vspace*{\fill}
88\noindent
89Project Acronym: \cerco{}\\
90Project full title: Certified Complexity\\
91Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
92
94
95\tableofcontents
96
97\section{Introduction}
98
99We present an executable formal semantics of the C programming language which
100will serve as the specification of the input language for the \cerco{}
101verified compiler.  Our semantics is based on Leroy et.~al.'s C semantics for
102the CompCert project~\cite{1111042,compcert-mm08}, which divides the treatment
103of C into two pieces.  The first is an \ocaml{} stage which parses and
104elaborates C into an abstract syntax tree for the simpler \clight{} language,
105based on the CIL C parser.  The second part is a small step semantics for
106\clight{} formalised in the proof tool, which we have ported from Coq to the
107\matita{} theorem prover.
108This semantics is given in the form of inductive definitions, and so we have
109added a third part giving an equivalent functional presentation in \matita{}.
110
111The \cerco{} compiler needs to deal with the constrained memory model of
112the target microcontroller (in our case, the 8051).  Thus each part of the
113semantics has been extended to allow explicit handling of the
114microcontroller's memory spaces.  \emph{Cost labels} have also been added to
115the \clight{} semantics to support the labelling approach to cost annotations
116presented in~\cite{d2.1}.
117
118The following section discusses the C language extensions for memory spaces.
119Then the port of the two stages of the CompCert \clight{} semantics is
120described in Section~\ref{sec:port}, followed by the new executable semantics
121in Section~\ref{sec:exec}.  Finally we discuss how the semantics can be tested
122in Section~\ref{sec:valid}.
123
124\section{Language extensions for the 8051 memory model}
125\label{sec:ext}
126
127The choice of an extended 8051 target for the \cerco{} compiler imposes an
128irregular memory model with tight resource constraints.  The different memory
129spaces and access modes are summarised in Figure~\ref{fig:memorymodel} ---
130essentially the development of the 8051 family has fragmented memory into
131four regions:  one half of the internal' memory is fully accessible but also
132contains the register banks, the second half cannot be directly addressed
133because it is shadowed by the Special Function Registers' (SFRs) for I/O;
134external memory' provides the bulk of memory in a separate address space; and
135the code is in its own read-only space.
136
137To make efficient use of the limited amount of memory, compilers for 8051
138microcontrollers provide extra keywords to allocate global variables to
139particular memory spaces, and to limit pointers to address a particular space.
140The freely available \sdcc{} compiler provides the following extensions for
141the 8051 memory spaces:
142\begin{table}[ht]
143\begin{tabular}{rcl}
144Attribute & Pointer size & Memory space \\\hline
145\lstinline'__data' & 1 & Internal, first half (0h -- 7fh) \\
146\lstinline'__idata' & 1 & Internal, indirect only (80h -- ffh) \\
147\lstinline'__pdata' & 1 & External, page access (usually 0h -- 7fh) \\
148\lstinline'__xdata' & 2 & External, any (0h -- ffffh) \\
149\lstinline'__code' & 2 & Code, any (0h -- ffffh) \\
150\emph{none}& 3 & Any / Generic pointer
151\end{tabular}
152\end{table}
153The generic pointers are a tagged union of the other kinds of pointers.
154
155We intend the \cerco{} compiler to support extensions that are broadly
156compatible with \sdcc{} to enable the compilation of programs with either.  In
157particular, this would allow the comparison of the behaviour of test cases
158compiled with both compilers.  Thus the C syntax and semantics have been
159extended with the memory space attributes listed above.  The syntax follows
160\sdcc{} and in the semantics we track the memory space that each block was
161allocated in and only permit access via the appropriate kinds of pointers.
162The details on these changes are given in the following sections.
163
164The \sdcc{} compiler also supports special variable types for accessing the
165SFRs, which provide the standard I/O mechanism for the 8051 family.  (Note
166that pointers to these types are not allowed because only direct addressing of
167the SFRs is allowed.)  We intend to use CompCert-style external functions'
168instead of special types.  These are functions which are declared, but no C
169implementation is provided.  Instead they are provided by the runtime or
170compiled directly to the corresponding machine code.  This has the advantage
171that no changes from the CompCert semantics are required, and a compatibility
172library can be provided for \sdcc{} if necessary.  The 8051 and the \sdcc{}
173compiler also provide bit-level access to a small region of internal memory.
174We do not intend to expose this feature to C programs in the \cerco{}
175compiler, and so no extension is provided for it.
176
177\section{Port of CompCert \clight{} semantics to \matita{}}
178\label{sec:port}
179
180The first stage taken from the CompCert semantics is the parsing and
181elaboration of C programs into the simpler \clight{} language.  This is based
182upon the CIL library for parsing, analysing and transforming C programs by
183Necula et.~al.~\cite{cil02}.  The elaboration performed provides explicit type
184information throughout the program, including extra casts for promotion, and
185performs simplifications such as breaking up expressions with side effects
186into statements that perform the effects using effect-free expressions.  The
187transformed \clight{} programs are much more manageable and lack the ambiguities
188of C, but also remain easily understood by C programmers.
189
190The parser has been extended with the 8051 memory spaces attributes given
191above.  The resulting abstract syntax tree records them on global variable
192declarations and pointer types.  However, we also need to deal with them
193during the elaboration process to produce all of the required information.
194For example, when the address-of operator \lstinline'&' is used it must decide
195which kind of pointer should be used.  Thus the extended elaboration process
196keeps track of the memory space (if any) that the value of each
197expression resides in.  Where the memory space is not known, a generic pointer
198will be used instead.  Moreover, we also include the pointer kind when
199determining whether a cast must be inserted so that conversions between
200pointer representations can be performed.
201
202
203Thus the elaboration turns the C code
204\begin{quote}
205\begin{lstlisting}[language=C]
206int g(int *x) { return 5; }
207
208int f(__data int *x, int *y) {
209  return x==y ? g(x) : *x;
210}
211
212__data int i = 1;
213
214int main(void) {
215  return f(&i, &i);
216}
217\end{lstlisting}
218\end{quote}
219into the following Clight program:
220\begin{quote}
221\begin{lstlisting}[language=C]
222int g(int *x) { return 5; }
223
224int f(__data int * x, int * y)
225{
226  int t;
227  if (x == (__data int * )y) {
228    t = g((int * )x);
229  } else {
230    t = *x;
231  }
232  return t;
233}
234
235int main(void)
236{
237  int t;
238  t = f(&i, (int * )(&i));
239  return t;
240}
241\end{lstlisting}
242\end{quote}
243The expressions in \lstinline'f' and \lstinline'main' had to be broken up due
244to side-effects, and casts have been added to change between generic pointers
245and pointers specific to the \lstinline'__data' section of memory.  The
246underlying data structure also has types attached to every expression, but
247these are inconvenient to show in source form.
248
249Note that the translation from C to \clight{} is not proven correct ---
250instead it effectively forms a semi-formal part of the whole C semantics.  We
251can have some confidence in the code, however, because it has received testing
252in the \cerco{} prototype, and it is very close to the version used in
253CompCert.  We can also perform testing of the semantics without involving the
254rest of the compiler because we have an executable semantics.  Moreover, the
255cautious programmer could choose to inspect the \clight{} code, or even work
256entirely in the \clight{} language.
257
258The semantics for \clight{} itself has been ported from the Coq development
259used in CompCert to \matita{} for use in \cerco{}.  Details about the original
260big-step formalisation of \clight{} can be found in Leroy and
261Blazy~\cite{compcert-mm08} (including a discussion of the translation from C
262in \S 4.1), although we started from a later version with a
263small-step semantics and hence support for \lstinline'goto' statements.
264Several parts of the semantics were shared with other parts of the CompCert
265development, notably:
266\begin{itemize}
267\item representation of primitive values (integers, pointers and undefined
268values, but not structures or unions) and operations on them,
269\item traces of I/O events,
270\item a memory model that keeps conceptually distinct sections of memory
271strictly separate (assigning undefined behaviour' to a buffer overflow, for
272instance),
273\item results about composing execution steps of arbitrary small-step
274semantics,
275\item data structures for local and global environments, and
276\item common error handling constructs, in particular an error monad.
277\end{itemize}
278We anticipate a similar arrangement for the \cerco{} verified compiler,
279although this means that there may be further changes to the common parts of the
280semantics later in the project to harmonise the stages of the compiler.
281
282The main body of the small-step semantics is a number of inductive definitions
283giving details of the defined behaviour for casts, expressions and statements.
284Expressions are side-effect free in \clight{} and only produce a value as
285output.  In our case we also need a trace of any cost labels that are
286evaluated' so that we will be able to give fine-grained costs for
287the execution of compiled conditional expressions.
288
289As an example of one of the expression rules, consider an expression which
290evaluates a variable, \lstinline'Expr (Evar id) ty'.  A variable is an example
291of a class of expressions called \emph{lvalues}, which are roughly those
292expressions which can be assigned to.  Thus we use a general rule for lvalues,
293\label{page:evalvar}
294\begin{quote}
295\begin{lstlisting}
296ninductive eval_expr (ge:genv) (e:env) (m:mem) : expr $\rightarrow$ val $\rightarrow$ trace $\rightarrow$ Prop ≝
297...
298  | eval_Elvalue: $\forall$a,ty,psp,loc,ofs,v,tr.
299      eval_lvalue ge e m (Expr a ty) psp loc ofs tr $\rightarrow$
300      load_value_of_type ty m psp loc ofs = Some ? v $\rightarrow$
301      eval_expr ge e m (Expr a ty) v tr
302\end{lstlisting}
303\end{quote}
304where the auxiliary relation \lstinline'eval_lvalue' yields the location of
305the value, \lstinline'psp,loc,ofs', consisting of memory space, memory block,
306and offset into the block, respectively.  The expression can thus evaluate to
307the value \lstinline'v' if \lstinline'v' can be loaded from that location.
308One corresponding part of the \lstinline'eval_lvalue' definition is
309\begin{quote}
310\begin{lstlisting}
311...
312with eval_lvalue (*(ge:genv) (e:env) (m:mem)*) :
313        expr $\rightarrow$ memory_space $\rightarrow$ block $\rightarrow$ int $\rightarrow$ trace $\rightarrow$ Prop ≝
314  | eval_Evar_local:   $\forall$id,l,ty.
315      get ??? id e = Some ? l $\rightarrow$
316      eval_lvalue ge e m (Expr (Evar id) ty) Any l zero E0
317...
318\end{lstlisting}
319\end{quote}
320simply looks up the variable in the local environment.  A similar rule handles
321global variables, with an extra check to ensure that no local variable has the
322same name.  Note that the two relations are defined using mutual recursion
323because \lstinline'eval_lvalue' uses \lstinline'eval_expr' for the evaluation
324of the pointer expression in the dereferencing rule.
325
326Casts also have an auxiliary relation to specify the allowed changes, and
327operations on values (including the changes in representation performed by
328casting) are given as functions.
329
330The only new expression in our semantics is the cost label which wraps around
331another expression.  It does not change the result, but merely adds the given
332label to the trace to identify the branches taken in conditional
333expressions so that accurate cost information can be attached to the program:
334\begin{quote}
335\begin{lstlisting}
336  | eval_Ecost: $\forall$a,ty,v,l,tr.
337      eval_expr ge e m a v tr $\rightarrow$
338      eval_expr ge e m (Expr (Ecost l a) ty) v (tr++Echarge l)
339\end{lstlisting}
340\end{quote}
341\todo{make the semantics of the cost labels clearer}
342
343As the expressions are side-effect free, all of the changes to the state are
344performed by statements.  The state itself is represented by records of the
345form
346\begin{quote}
347\begin{lstlisting}
348ninductive state: Type :=
349  | State:
350      $\forall$f: function.
351      $\forall$s: statement.
352      $\forall$k: cont.
353      $\forall$e: env.
354      $\forall$m: mem.  state
355  | Callstate:
356      $\forall$fd: fundef.
357      $\forall$args: list val.
358      $\forall$k: cont.
359      $\forall$m: mem. state
360  | Returnstate:
361      $\forall$res: val.
362      $\forall$k: cont.
363      $\forall$m: mem. state.
364\end{lstlisting}
365\end{quote}
366During normal execution the state contains the currently executing function's
367definition (used to find \lstinline'goto' labels and also to check whether the
368function is expected to return a value), the statement to be executed next, a
369continuation value to be executed afterwards (where successor statements and
370details of function calls and loops are stored), the local environment mapping
371variables to memory locations and the current memory state.
372\todo{need to note that all variables 'appear' to be memory allocated, even
373       if they're subsequently optimised away.}
374The function call and return states appear to store less information because
375the details of the caller are contained in the continuation.
376
377An example of the statement execution rules is the assignment rule
378(corresponding to the C syntax \lstinline'a1 = a2'),
379\label{page:Sassign}
380\begin{quote}
381\begin{lstlisting}
382ninductive step (ge:genv) : state $\rightarrow$ trace $\rightarrow$ state $\rightarrow$ Prop ≝
383  | step_assign:   $\forall$f,a1,a2,k,e,m,psp,loc,ofs,v2,m',tr1,tr2.
384      eval_lvalue ge e m a1 psp loc ofs tr1 $\rightarrow$
385      eval_expr ge e m a2 v2 tr2 $\rightarrow$
386      store_value_of_type (typeof a1) m psp loc ofs v2 = Some ? m' $\rightarrow$
387      step ge (State f (Sassign a1 a2) k e m)
388           (tr1++tr2) (State f Sskip k e m')
389...
390\end{lstlisting}
391\end{quote}
393\begin{itemize}
394\item if \lstinline'a1' can evaluate to the location \lstinline'psp,loc,ofs',
395\item \lstinline'a2' evaluates to a value \lstinline'v2', and
396\item storing \lstinline'v2' at location \lstinline'psp,loc,ofs' succeeds,
397yielding the new memory state \lstinline-m'-, then
398\item the program can step from the state about to execute
399\lstinline'Sassign a1 a2' to a state with the updated memory \lstinline-m'-
400about to execute the no-op \lstinline'Sskip'.
401\end{itemize}
402This rule would be followed by one of the rules to which replaces the
403\lstinline'Sskip' statement with the real' next statement constructed from
404the continuation \lstinline'k'.  Note that the only true side-effect here is
405the change in memory --- the local environment is initialised once and for all
406on function entry, and the only events appearing in the trace are cost labels
407used purely for accounting.
408\todo{ordering imposed by cost labels?}
409
410The \clight{} language provides input and output effects through external'
411functions and the step rule
412\begin{quote}
413\begin{lstlisting}
414  | step_external_function: $\forall$id,targs,tres,vargs,k,m,vres,t.
415      event_match (external_function id targs tres) vargs t vres $\rightarrow$
416      step ge (Callstate (External id targs tres) vargs k m)
417            t (Returnstate vres k m)
418\end{lstlisting}
419\end{quote}
420which allows the function to be invoked with and return any values subject to
421the enforcement of the typing rules in \lstinline'event_match', which also
422provides the trace.
423
424Cost label statements add the given label to the trace analogously to the cost
425label expressions above.
426
427\section{Executable semantics}
428\label{sec:exec}
429
430\todo{bit of a cheat to claim that it's equivalent without a completeness
431proof.}
432We have added an equivalent functional version of the \clight{} semantics that
433can be used to animate programs.  The definitions roughly follow the inductive
434semantics, but are necessarily rearranged around pattern matching of the
435relevant parts of the state rather than presenting each case separately.
436
437The code corresponding to the variable lookup definitions on
438page~\pageref{page:evalvar} is
439\begin{quote}
440\begin{lstlisting}
441nlet rec exec_expr (ge:genv) (en:env) (m:mem) (e:expr) on e :
442               res ($\Sigma$r:val$\times$trace. eval_expr ge en m e (\fst r) (\snd r)) ≝
443match e with
444[ Expr e' ty $\Rightarrow$
445  match e' with
446  [ ...
447  | Evar _ $\Rightarrow$ Some ? (
448      do $\langle$l,tr$\rangle$ $\leftarrow$ exec_lvalue' ge en m e' ty;
449      do v $\leftarrow$ opt_to_res ? (load_value_of_type' ty m l);
450      OK ? $\langle$v,tr$\rangle$)
451...
452  ]
453]
454and exec_lvalue' (ge:genv) (en:env) (m:mem) (e':expr_descr) (ty:type) on e' :
455    res ($\Sigma$r:memory_space $\times$ block $\times$ int $\times$ trace. eval_lvalue ge en m (Expr e' ty) (\fst (\fst (\fst r))) (\snd (\fst (\fst r))) (\snd (\fst r)) (\snd r)) ≝
456  match e' with
457  [ Evar id $\Rightarrow$
458      match (get … id en) with
459      [ None $\Rightarrow$ Some ? (do $\langle$sp,l$\rangle$ $\leftarrow$ opt_to_res ? (find_symbol ? ? ge id); OK ? $\langle$$\langle$$\langle$sp,l$\rangle$,zero$\rangle$,E0$\rangle$) (* global *)
460      | Some loc $\Rightarrow$ Some ? (OK ? $\langle$$\langle$$\langle$Any,loc$\rangle$,zero$\rangle$,E0$\rangle$) (* local *)
461      ]
462...
463\end{lstlisting}
464\end{quote}
465where the result is placed in an error monad (the \lstinline'res' type
466constructor) so that \emph{undefined behaviour} such as dereferencing an
467invalid pointer can be rejected.  We use \lstinline'do' notation similar to
469\begin{quote}
470\begin{lstlisting}
471do x $\leftarrow$ e; e'
472\end{lstlisting}
473\end{quote}
474means evaluate \lstinline'e' and if it yields a value then bind that to
475\lstinline'x' and evaluate \lstinline-e'-, and otherwise propogate the error.
476
477Note the $\Sigma$ type for the result of the function, which shows that
478successful executions are sound with respect to the inductive semantics.
479Matita automatically generates proof obligations for each case due to a
480coercion between the types
481\begin{center}
482\lstinline'option (res T)' \quad and \quad \lstinline'res ($\Sigma$x:T. P x)'
483\end{center}
484(where a branch marked \lstinline'None' would generate a proof obligation to
485show that it is impossible, although the semantics do not use this feature).
486This is intended to mimic Sozeau's \textsc{Russell} language and elaboration
487into Coq~\cite{sozeau06}.  The coercion also triggers an automatic mechanism
488in \matita{} to add equalities for each pattern matched.  Each case of the
489soundness proof consists of extracting an equality for each computation in the
491% "in the monad" is a bit vague here
492making any further case distinctions necessary and applying the corresponding
493rule from the inductive semantics.
494
495Evaluating a step of a statement is complicated by the presence of the
496`external' functions for I/O, which can return arbitrary values.  These are
497handled by a resumption monad, which on encountering some I/O returns a
498suspension. When the suspension is applied to a value the evaluation of the
499semantics is resumed.  Resumption monads are a standard tool for providing
500denotational semantics for input~\cite{Moggi199155} and interleaved
501concurrency~\cite{??}.
502The definition also incorporates errors, and uses a coercion to automatically
503transform values from the plain error monad.
504\todo{should perhaps give more details of the resumption monad?}
505
506The execution of assignments is straightforward,
507\begin{quote}
508\begin{lstlisting}
509nlet rec exec_step (ge:genv) (st:state) on st : (IO eventval io_out ($\Sigma$r:trace $\times$ state. step ge st (\fst r) (\snd r))) ≝
510match st with
511[ State f s k e m $\Rightarrow$
512  match s with
513  [ Sassign a1 a2 $\Rightarrow$ Some ? (
514    ! $\langle$l,tr1$\rangle$ $\leftarrow$ exec_lvalue ge e m a1;:
515    ! $\langle$v2,tr2$\rangle$ $\leftarrow$ exec_expr ge e m a2;:
516    ! m' $\leftarrow$ store_value_of_type' (typeof a1) m l v2;:
517    ret ? $\langle$tr1++tr2, State f Sskip k e m'$\rangle$)
518...
519\end{lstlisting}
520\end{quote}
521where \lstinline'!' is used in place of \lstinline'do' due to the change in
522monad.  The content is essentially the same as the inductive rule given on
523page~\pageref{page:Sassign}.
524
525The handling of external calls uses the
526\begin{quote}
527\begin{lstlisting}
528do_io : ident $\rightarrow$ list eventval $\rightarrow$ IO eventval io_out eventval
529\end{lstlisting}
530\end{quote}
531function to suspend execution:
532\begin{quote}
533\begin{lstlisting}
534...
535  ]
536| Callstate f0 vargs k m $\Rightarrow$
537  match f0 with
538  [ ...
539  | External f argtys retty $\Rightarrow$ Some ? (
540      ! evargs $\leftarrow$ check_eventval_list vargs (typlist_of_typelist argtys);:
541      ! evres $\leftarrow$ do_io f evargs;:
542      ! vres $\leftarrow$ check_eventval evres (proj_sig_res (signature_of_type argtys rett
543      ret ? $\langle$(Eextcall f evargs evres), Returnstate vres k m$\rangle$)
544...
545\end{lstlisting}
546\end{quote}
547\todo{say something more about the rest?}
548
549Together with functions to provide the initial state for a program and to
550detect a final state we can write a function to run the program up to a given
551number of steps.  Similarly, a corecursive function can return the entire
552execution as a stream of trace and state pairs.
553
554\todo{completeness?}
555
556\section{Validation}
557\label{sec:valid}
558
559\begin{verbatim}
560Animation of simple C programs.
561\end{verbatim}
562
563\begin{verbatim}
564MORE:
565  library stuff: choice of integers, maps
566  check for any axioms
567\end{verbatim}
568
569\bibliographystyle{plain}
570\bibliography{report}
571
572\end{document}
Note: See TracBrowser for help on using the repository browser.