\documentclass[11pt,epsf,a4wide]{article}
\usepackage[utf8x]{inputenc}
\usepackage{listings}
\usepackage{../../style/cerco}
\newcommand{\ocaml}{OCaml}
\newcommand{\clight}{Clight}
\newcommand{\matita}{Matita}
\newcommand{\sdcc}{\texttt{sdcc}}
\newcommand{\textSigma}{\ensuremath{\Sigma}}
% LaTeX Companion, p 74
\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
\lstdefinelanguage{coq}
{keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
morekeywords={[2]if,then,else},
}
\lstdefinelanguage{matita}
{keywords={ndefinition,nlemma,ntheorem,nremark,ninductive,nrecord,nqed,nlet,rec,match,with,Type,and,on},
morekeywords={[2]nwhd,nnormalize,nelim,ncases,ndestruct},
mathescape=true,
}
\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
keywordstyle=\color{red}\bfseries,
keywordstyle=[2]\color{blue},
commentstyle=\color{green},
stringstyle=\color{blue},
showspaces=false,showstringspaces=false}
\lstset{extendedchars=false}
\lstset{inputencoding=utf8x}
\DeclareUnicodeCharacter{8797}{:=}
\DeclareUnicodeCharacter{10746}{++}
\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
\title{
INFORMATION AND COMMUNICATION TECHNOLOGIES\\
(ICT)\\
PROGRAMME\\
\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
\date{ }
\author{}
\begin{document}
\thispagestyle{empty}
\vspace*{-1cm}
\begin{center}
\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
\end{center}
\begin{minipage}{\textwidth}
\maketitle
\end{minipage}
\vspace*{0.5cm}
\begin{center}
\begin{LARGE}
\bf
Report n. D3.1\\
Executable Formal Semantics of C\\
\end{LARGE}
\end{center}
\vspace*{2cm}
\begin{center}
\begin{large}
Version 1.0
\end{large}
\end{center}
\vspace*{0.5cm}
\begin{center}
\begin{large}
Main Authors:\\
Brian~Campbell, Randy~Pollack
\end{large}
\end{center}
\vspace*{\fill}
\noindent
Project Acronym: \cerco{}\\
Project full title: Certified Complexity\\
Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
\newpage
\vspace*{7cm}
\paragraph{Abstract}
We present an execution semantics of the C programming language for
use in the \cerco{} project. It is based on the small-step inductive
semantics used by the CompCert verified compiler. We discuss the
extensions required for our target architecture, porting the semantics
to our choice of tool, \matita{}, providing an equivalent executable
semantics and the validation of the semantics.
\newpage
\tableofcontents
\section{Introduction}
We present an executable formal semantics of the C programming language which
will serve as the specification of the input language for the \cerco{}
verified compiler. Our semantics is based on Leroy et.~al.'s C semantics for
the CompCert project~\cite{1111042,compcert-mm08}, which divides the treatment
of C into two pieces. The first is an \ocaml{} stage which parses and
elaborates C into an abstract syntax tree for the simpler \clight{} language,
based on the CIL C parser. The second part is a small step semantics for
\clight{} formalised in the proof tool, which we have ported from Coq to the
\matita{} theorem prover.
This semantics is given in the form of inductive definitions, and so we have
added a third part giving an equivalent functional presentation in \matita{}.
The \cerco{} compiler needs to deal with the constrained memory model of
the target microcontroller (in our case, the 8051). Thus each part of the
semantics has been extended to allow explicit handling of the
microcontroller's memory spaces. \emph{Cost labels} have also been added to
the \clight{} semantics to support the labelling approach to cost annotations
presented in a previous deliverable~\cite{d2.1}.
The following section discusses the C language extensions for memory spaces.
Then the port of the two stages of the CompCert \clight{} semantics is
described in Section~\ref{sec:port}, followed by the new executable semantics
in Section~\ref{sec:exec}. Finally we discuss how the semantics has
been validated
in Section~\ref{sec:valid}.
\section{Language extensions for the 8051 memory model}
\label{sec:ext}
The choice of an extended 8051 target for the \cerco{} compiler imposes an
irregular memory model with tight resource constraints. The different memory
spaces and access modes are summarised in Figure~\ref{fig:memorymodel} ---
essentially the evolution of the 8051 family has fragmented memory into
four regions: one half of the `internal' memory is fully accessible but also
contains the register banks, the second half cannot be accessed by direct addressing
because it is shadowed by the `Special Function Registers' (SFRs) for I/O;
`external memory' provides the bulk of memory in a separate address space; and
the code is in its own read-only space.
\begin{figure}
\setlength{\unitlength}{1pt}
\begin{picture}(410,250)(-50,200)
%\put(-50,200){\framebox(410,250){}}
\put(12,410){\makebox(80,0)[b]{Internal (256B)}}
\put(13,242){\line(0,1){165}}
\put(93,242){\line(0,1){165}}
\put(13,407){\line(1,0){80}}
\put(12,400){\makebox(0,0)[r]{0h}} \put(14,400){\makebox(0,0)[l]{Register bank 0}}
\put(13,393){\line(1,0){80}}
\put(12,386){\makebox(0,0)[r]{8h}} \put(14,386){\makebox(0,0)[l]{Register bank 1}}
\put(13,379){\line(1,0){80}}
\put(12,372){\makebox(0,0)[r]{10h}} \put(14,372){\makebox(0,0)[l]{Register bank 2}}
\put(13,365){\line(1,0){80}}
\put(12,358){\makebox(0,0)[r]{18h}} \put(14,358){\makebox(0,0)[l]{Register bank 3}}
\put(13,351){\line(1,0){80}}
\put(12,344){\makebox(0,0)[r]{20h}} \put(14,344){\makebox(0,0)[l]{Bit addressable}}
\put(13,323){\line(1,0){80}}
\put(12,316){\makebox(0,0)[r]{30h}}
\put(14,309){\makebox(0,0)[l]{\quad \vdots}}
\put(13,291){\line(1,0){80}}
\put(12,284){\makebox(0,0)[r]{80h}}
\put(14,263){\makebox(0,0)[l]{\quad \vdots}}
\put(12,249){\makebox(0,0)[r]{ffh}}
\put(13,242){\line(1,0){80}}
\qbezier(-2,407)(-6,407)(-6,393)
\qbezier(-6,393)(-6,324)(-10,324)
\put(-12,324){\makebox(0,0)[r]{Indirect/Stack}}
\qbezier(-6,256)(-6,324)(-10,324)
\qbezier(-2,242)(-6,242)(-6,256)
\qbezier(94,407)(98,407)(98,393)
\qbezier(98,393)(98,349)(102,349)
\put(104,349){\makebox(0,0)[l]{Direct}}
\qbezier(98,305)(98,349)(102,349)
\qbezier(94,291)(98,291)(98,305)
\put(102,242){\framebox(20,49){SFR}}
% bit access to sfrs?
\qbezier(124,291)(128,291)(128,277)
\qbezier(128,277)(128,266)(132,266)
\put(134,266){\makebox(0,0)[l]{Direct}}
\qbezier(128,257)(128,266)(132,266)
\qbezier(124,242)(128,242)(128,256)
\put(164,410){\makebox(80,0)[b]{External (64kB)}}
\put(164,220){\line(0,1){187}}
\put(164,407){\line(1,0){80}}
\put(244,220){\line(0,1){187}}
\put(164,242){\line(1,0){80}}
\put(163,400){\makebox(0,0)[r]{0h}}
\put(164,324){\makebox(80,0){Paged access}}
\put(164,310){\makebox(80,0){direct/indirect}}
\put(163,235){\makebox(0,0)[r]{80h}}
\put(164,228){\makebox(80,0){\vdots}}
\put(164,210){\makebox(80,0){Direct/indirect}}
\put(264,410){\makebox(80,0)[b]{Code (64kB)}}
\put(264,220){\line(0,1){187}}
\put(264,407){\line(1,0){80}}
\put(344,220){\line(0,1){187}}
\put(263,400){\makebox(0,0)[r]{0h}}
\put(264,228){\makebox(80,0){\vdots}}
\put(264,324){\makebox(80,0){Dsirect}}
\put(264,310){\makebox(80,0){PC relative}}
\end{picture}
\caption{The extended 8051 memory model}
\label{fig:memorymodel}
\end{figure}
To make efficient use of the limited amount of memory, compilers for 8051
microcontrollers provide extra keywords to allocate global variables to
particular memory spaces, and to limit pointers to address a particular space.
The freely available \sdcc{} compiler provides the following extensions for
the 8051 memory spaces:
\begin{table}[ht]
\begin{tabular}{rcl}
Attribute & Pointer size (bytes) & Memory space \\\hline
\lstinline'__data' & 1 & Internal, first half (0h -- 7fh) \\
\lstinline'__idata' & 1 & Internal, indirect only (80h -- ffh) \\
\lstinline'__pdata' & 1 & External, page access (usually 0h -- 7fh) \\
\lstinline'__xdata' & 2 & External, any (0h -- ffffh) \\
\lstinline'__code' & 2 & Code, any (0h -- ffffh) \\
\emph{none}& 3 & Any / Generic pointer
\end{tabular}
\end{table}\\
The generic pointers are a tagged union of the other kinds of pointers.
We intend the \cerco{} compiler to support extensions that are broadly
compatible with \sdcc{} to enable the compilation of programs with
either tool. In particular, this would allow the comparison of the
behaviour of test cases compiled with each compiler. Thus the C
syntax and semantics have been extended with the memory space
attributes listed above. The syntax follows \sdcc{} and in the
semantics we track the memory space that each block was allocated from
and only permit access via the appropriate kinds of pointers. The
details on these changes are given in the following sections.
The \sdcc{} compiler also supports special variable types for accessing the
SFRs, which provide the standard I/O mechanism for the 8051 family. (Note
that pointers to these types are not permitted because only direct addressing of
the SFRs is allowed.) We intend to use CompCert-style `external functions'
instead of special types. These are functions which are declared, but no C
implementation of them is provided. Instead they are provided by the runtime or
compiled directly to the corresponding machine code. This has the advantage
that no changes from the CompCert semantics are required, and a compatibility
library can be provided for \sdcc{} if necessary. The 8051 and the \sdcc{}
compiler also provide bit-level access to a small region of internal memory.
We do not intend to expose this feature to C programs in the \cerco{}
compiler, and so no extension is provided for it.
Finally, we have the option of using CompCert's translation of
\lstinline'volatile' variable accesses to `external' function calls. Should we
need more flexible I/O than SFRs provide, then we could adopt the \sdcc{}
extension to allocate a variable at a particular address to provide a way to
deal with memory mapped I/O in the external memory space. The translation to
function calls would mean that the semantics presented here would be
unaffected.
\section{Port of CompCert \clight{} semantics to \matita{}}
\label{sec:port}
\subsection{Parsing and elaboration}
The first stage taken from the CompCert semantics is the parsing and
elaboration of C programs into the simpler \clight{} language. This is based
upon the CIL library for parsing, analysing and transforming C programs by
Necula et.~al.~\cite{cil02}. The elaboration provides explicit type
information throughout the program, including extra casts for
promotion. It also
performs simplifications such as breaking up expressions with side effects
into effect-free expressions along with statements to perform the effects. The
transformed \clight{} programs are much more manageable and lack the ambiguities
of C, but also remain easily understood by C programmers.
The parser has been extended with the 8051 memory spaces attributes given
above. The resulting abstract syntax tree records them on global variable
declarations and pointer types. However, we also need to deal with them
during the elaboration process to produce all of the required type information.
For example, when the address-of operator \lstinline'&' is used it must decide
which kind of pointer should be used. Thus the extended elaboration process
keeps track of the memory space (if any) that the value of each
expression resides in. Where the memory space is not known, a generic pointer
will be used instead. Moreover, we also include the pointer kind when
determining whether a cast must be inserted so that conversions between
pointer representations can be performed.
Thus the elaboration turns the following C code
\begin{quote}
\begin{lstlisting}[language=C]
int g(int *x) { return 5; }
int f(__data int *x, int *y) {
return x==y ? g(x) : *x;
}
__data int i = 1;
int main(void) {
return f(&i, &i);
}
\end{lstlisting}
\end{quote}
into the Clight program below:
\begin{quote}
\begin{lstlisting}[language=C]
int g(int *x) { return 5; }
int f(__data int * x, int * y)
{
int t;
if (x == (__data int * )y) {
t = g((int * )x);
} else {
t = *x;
}
return t;
}
int main(void)
{
int t;
t = f(&i, (int * )(&i));
return t;
}
\end{lstlisting}
\end{quote}
The expression in \lstinline'f' had to be broken up due
to the call to \lstinline'g', and casts have been added to change between generic pointers
and pointers specific to the \lstinline'__data' section of memory. The
underlying data structure also has types attached to every expression, but
these are impractical to show in source form.
Note that the translation from C to \clight{} is not proven correct
--- instead it effectively forms a semi-formal part of the whole C
semantics. We can have some confidence in the code, however, because
it has received testing in the \cerco{} prototype, and it is very
close to the version used in CompCert. We can also perform testing of
the semantics without involving the rest of the compiler because we
have an executable semantics. Moreover, the cautious programmer could
choose to inspect the generated \clight{} code, or even work entirely
in the \clight{} language.
\subsection{Small-step inductive semantics}
\label{sec:inductive}
The semantics for \clight{} itself has been ported from the Coq development
used in CompCert to \matita{} for use in \cerco{}. Details about the original
big-step formalisation of \clight{} can be found in Leroy and
Blazy~\cite{compcert-mm08} (including a discussion of the translation from C
in \S 4.1), although we started from a later version with a
small-step semantics and hence support for \lstinline'goto' statements.
Several parts of the semantics were shared with other parts of the CompCert
development, notably:
\begin{itemize}
\item the representation of primitive values (integers, pointers and undefined
values, but not structures or unions) and operations on them,
\item traces of I/O events,
\item a memory model that keeps conceptually distinct sections of memory
strictly separate (assigning `undefined behaviour' to a buffer overflow, for
instance),
\item results about composing execution steps of arbitrary small-step
semantics,
\item data structures for local and global environments, and
\item common error handling constructs, in particular an error monad.
\end{itemize}
We anticipate a similar arrangement for the \cerco{} verified
compiler, although this means that there may be further changes to the
common parts of the semantics later in the project to harmonise the
stages of the compiler. In particular, some of data structures for
environments are just preliminary definitions for developing the
semantics.
The main body of the small-step semantics is a number of inductive definitions
giving details of the defined behaviour for casts, expressions and statements.
Expressions are side-effect free in \clight{} and only produce a value as
output. In our case we also need a trace of any cost labels that are
`evaluated' so that we will be able to give fine-grained costs for
the execution of compiled conditional expressions.
As an example of one of the expression rules, consider an expression which
evaluates a variable, \lstinline'Expr (Evar id) ty'. A variable is an example
of a class of expressions called \emph{lvalues}, which are roughly those
expressions which can be assigned to. Thus we use a general rule for lvalues,
\label{page:evalvar}
\begin{quote}
\begin{lstlisting}
ninductive eval_expr (ge:genv) (e:env) (m:mem) : expr $\rightarrow$ val $\rightarrow$ trace $\rightarrow$ Prop :=
...
| eval_Elvalue: $\forall$a,ty,psp,loc,ofs,v,tr.
eval_lvalue ge e m (Expr a ty) psp loc ofs tr $\rightarrow$
load_value_of_type ty m psp loc ofs = Some ? v $\rightarrow$
eval_expr ge e m (Expr a ty) v tr
\end{lstlisting}
\end{quote}
where the auxiliary relation \lstinline'eval_lvalue' yields the location of
the value, \lstinline'psp,loc,ofs', consisting of memory space, memory block,
and offset into the block, respectively. The expression can thus evaluate to
the value \lstinline'v' if \lstinline'v' can be loaded from that location.
One corresponding part of the \lstinline'eval_lvalue' definition is
\begin{quote}
\begin{lstlisting}
...
with eval_lvalue (*(ge:genv) (e:env) (m:mem)*) :
expr $\rightarrow$ memory_space $\rightarrow$ block $\rightarrow$ int $\rightarrow$ trace $\rightarrow$ Prop ≝
| eval_Evar_local: $\forall$id,l,ty.
get ??? id e = Some ? l $\rightarrow$
eval_lvalue ge e m (Expr (Evar id) ty) Any l zero E0
...
\end{lstlisting}
\end{quote}
simply looks up the variable in the local environment. The offset is
zero because all variables are given their own memory block to prevent
the use of stray pointers. A similar rule handles global variables,
with an extra check to ensure that no local variable has the same
name. Note that the two relations are defined using mutual recursion
because \lstinline'eval_lvalue' uses \lstinline'eval_expr' for the
evaluation of the pointer expression in the dereferencing rule.
Casts also have an auxiliary relation to specify the allowed changes, and
operations on values (including the changes in representation performed by
casting) are given as functions.
The only new expression in our semantics is the cost label which wraps
around another expression. It does not change the result, but merely
augments the trace with the given label to identify the branches taken
in conditional expressions so that accurate cost information can be
attached to the program:
\begin{quote}
\begin{lstlisting}
| eval_Ecost: $\forall$a,ty,v,l,tr.
eval_expr ge e m a v tr $\rightarrow$
eval_expr ge e m (Expr (Ecost l a) ty) v (tr++Echarge l)
\end{lstlisting}
\end{quote}
As the expressions are side-effect free, all of the changes to the state are
performed by statements. The state itself is represented by records of the
form
\begin{quote}
\begin{lstlisting}
ninductive state: Type :=
| State:
$\forall$f: function.
$\forall$s: statement.
$\forall$k: cont.
$\forall$e: env.
$\forall$m: mem. state
| Callstate:
$\forall$fd: fundef.
$\forall$args: list val.
$\forall$k: cont.
$\forall$m: mem. state
| Returnstate:
$\forall$res: val.
$\forall$k: cont.
$\forall$m: mem. state.
\end{lstlisting}
\end{quote}
During normal execution the state contains the currently executing
function's definition (used to find \lstinline'goto' labels and also
to check whether the function is expected to return a value), the
statement to be executed next, a continuation value to be executed
afterwards (where successor statements and details of function calls
and loops are stored), the local environment mapping variables to
memory locations\footnote{In the semantics all variables are
allocated, although the compiler may subsequently allocate them to
registers where possible.} and the current memory state. The
function call and return states appear to store less information
because the details of the caller are contained in the continuation.
An example of the statement execution rules is the assignment rule
(corresponding to the C syntax \lstinline'a1 = a2'),
\label{page:Sassign}
\begin{quote}
\begin{lstlisting}
ninductive step (ge:genv) : state $\rightarrow$ trace $\rightarrow$ state $\rightarrow$ Prop ≝
| step_assign: $\forall$f,a1,a2,k,e,m,psp,loc,ofs,v2,m',tr1,tr2.
eval_lvalue ge e m a1 psp loc ofs tr1 $\rightarrow$
eval_expr ge e m a2 v2 tr2 $\rightarrow$
store_value_of_type (typeof a1) m psp loc ofs v2 = Some ? m' $\rightarrow$
step ge (State f (Sassign a1 a2) k e m)
(tr1++tr2) (State f Sskip k e m')
...
\end{lstlisting}
\end{quote}
which can be read as:
\begin{itemize}
\item if \lstinline'a1' can evaluate to the location \lstinline'psp,loc,ofs',
\item \lstinline'a2' can evaluate to a value \lstinline'v2', and
\item storing \lstinline'v2' at location \lstinline'psp,loc,ofs' succeeds,
yielding the new memory state \lstinline-m'-, then
\item the program can step from the state about to execute
\lstinline'Sassign a1 a2' to a state with the updated memory \lstinline-m'-
about to execute the no-op \lstinline'Sskip'.
\end{itemize}
This rule would be followed by one of the rules to which replaces the
\lstinline'Sskip' statement with the `real' next statement constructed
from the continuation \lstinline'k'. Note that the only true
side-effect here is the change in memory --- the local environment is
initialised once and for all on function entry, and the only events
appearing in the trace are cost labels used purely for accounting. At
present this imposes an ordering due to the cost labels. Should this
prove too restrictive we may change it to produce a set of labels
encountered.
The \clight{} language provides input and output effects through `external'
functions and the step rule
\begin{quote}
\begin{lstlisting}
| step_external_function: $\forall$id,targs,tres,vargs,k,m,vres,t.
event_match (external_function id targs tres) vargs t vres $\rightarrow$
step ge (Callstate (External id targs tres) vargs k m)
t (Returnstate vres k m)
\end{lstlisting}
\end{quote}
which allows the function to be invoked with and return any values subject to
the enforcement of the typing rules in \lstinline'event_match', which also
provides the trace.
Cost label statements prefix the trace with the given label, similar to the cost
label expressions above.
\section{Executable semantics}
\label{sec:exec}
We have added an equivalent functional version of the \clight{} semantics that
can be used to animate programs. The definitions roughly follow the inductive
semantics, but are necessarily rearranged around pattern matching of the
relevant parts of the state rather than presenting each case separately.
\subsection{Expressions}
The code corresponding to the variable lookup definitions on
page~\pageref{page:evalvar} is
\begin{quote}
\begin{lstlisting}
nlet rec exec_expr (ge:genv) (en:env) (m:mem) (e:expr) on e : res (val$\times$trace) ≝
match e with
[ Expr e' ty $\Rightarrow$
match e' with
[ ...
| Evar _ $\Rightarrow$
do $\langle$l,tr$\rangle$ $\leftarrow$ exec_lvalue' ge en m e' ty;
do v $\leftarrow$ opt_to_res ? (load_value_of_type' ty m l);
OK ? $\langle$v,tr$\rangle$
...
]
]
and exec_lvalue' (ge:genv) (en:env) (m:mem) (e':expr_descr) (ty:type) on e'
: res (memory_space $\times$ block $\times$ int $\times$ trace) ≝
match e' with
[ Evar id $\Rightarrow$
match (get … id en) with
[ None $\Rightarrow$ do $\langle$sp,l$\rangle$ $\leftarrow$ opt_to_res ? (find_symbol ? ? ge id);
OK ? $\langle$$\langle$$\langle$sp,l$\rangle$,zero$\rangle$,E0$\rangle$ (* global *)
| Some loc $\Rightarrow$ OK ? $\langle$$\langle$$\langle$Any,loc$\rangle$,zero$\rangle$,E0$\rangle$ (* local *)
]
...
\end{lstlisting}
\end{quote}
where the result is placed in an error monad (the \lstinline'res' type
constructor) so that \emph{undefined behaviour} such as dereferencing an
invalid pointer can be rejected. We use \lstinline'do' notation similar to
Haskell and CompCert, where
\begin{quote}
\begin{lstlisting}
do x $\leftarrow$ e; e'
\end{lstlisting}
\end{quote}
means evaluate \lstinline'e' and if it yields a value then bind that to
\lstinline'x' and evaluate \lstinline-e'-, and otherwise propogate the error.
\subsection{Statements}
Evaluating a step of a statement is complicated by the presence of the
`external' functions for I/O, which can return arbitrary values. These are
handled by a resumption monad, which on encountering some I/O returns a
suspension. When the suspension is applied to a value the evaluation of the
semantics is resumed. Resumption monads are a standard tool for providing
denotational semantics for input~\cite{Moggi199155} and interleaved
concurrency~\cite[Chapter 12]{schmidt86}.
The definition also incorporates errors, and uses a coercion to automatically
transform values from the plain error monad.
The definition of the monad is:
\begin{quote}
\begin{lstlisting}
ninductive IO (output:Type) (input:output $\rightarrow$ Type) (T:Type) : Type :=
| Interact : $\forall$o:output. (input o $\rightarrow$ IO output input T) $\rightarrow$ IO output input T
| Value : T $\rightarrow$ IO output input T
| Wrong : IO output input T.
nlet rec bindIO (O:Type) (I:O $\rightarrow$ Type) (T,T':Type)
(v:IO O I T) (f:T $\rightarrow$ IO O I T') on v : IO O I T' :=
match v with
[ Interact out k $\Rightarrow$ (Interact ??? out ($\lambda$res. bindIO O I T T' (k res) f))
| Value v' $\Rightarrow$ (f v')
| Wrong $\Rightarrow$ Wrong O I T'
].
\end{lstlisting}
\end{quote}
Note that the type of the input value is dependent on the output value.
This enables us to ensure that the input is always well-typed. An
alternative approach is a check in the semantics, but this causes
programs to fail in a way that has no counterpart in the inductive
semantics.
The execution of assignments is straightforward,
\begin{quote}
\begin{lstlisting}
nlet rec exec_step (ge:genv) (st:state) on st : (IO io_out io_in (trace $\times$ state)) ≝
match st with
[ State f s k e m $\Rightarrow$
match s with
[ Sassign a1 a2 $\Rightarrow$
! $\langle$l,tr1$\rangle$ $\leftarrow$ exec_lvalue ge e m a1;
! $\langle$v2,tr2$\rangle$ $\leftarrow$ exec_expr ge e m a2;
! m' $\leftarrow$ store_value_of_type' (typeof a1) m l v2;
ret ? $\langle$tr1++tr2, State f Sskip k e m'$\rangle$
...
\end{lstlisting}
\end{quote}
where \lstinline'!' is used in place of \lstinline'do' due to the change in
monad. The content is essentially the same as the inductive rule given on
page~\pageref{page:Sassign}.
Most other rules are similar translations of the inductive semantics.
The handling of external calls uses the
\begin{quote}
\begin{lstlisting}
do_io : ident $\rightarrow$ list eventval $\rightarrow$ IO eventval io_out eventval
\end{lstlisting}
\end{quote}
function to suspend execution:
\begin{quote}
\begin{lstlisting}
...
| Callstate f0 vargs k m $\Rightarrow$
match f0 with
[ ...
| External f argtys retty $\Rightarrow$
! evargs $\leftarrow$ err_to_io_sig … (check_eventval_list vargs (typlist_of_typelist argtys));
! evres $\leftarrow$ do_io f evargs (proj_sig_res (signature_of_type argtys retty));
ret ? $\langle$Eextcall f evargs (mk_eventval ? evres), Returnstate (mk_val ? evres) k m$\rangle$
]
...
\end{lstlisting}
\end{quote}
The rest of the code after \lstinline'do_io' is included in the
suspension returned.
Together with functions to provide the initial state for a program and to
detect a final state we can write a function to run the program up to a given
number of steps. Similarly, a corecursive function can return the entire
execution as a stream of trace and state pairs.
\section{Validation}
\label{sec:valid}
We have used two methods to validate our executable semantics: we have
proven them equivalent to the inductive semantics of
Section~\ref{sec:inductive}, and we have animated small examples of
key areas.
\subsection{Equivalence to inductive semantics}
To show that the executable semantics are sound with respect to the
inductive semantics we need to prove that any value produced by each
function satisfies the corresponding relation, modulo errors and
resumption. To deal with these monads we lift the properties
required. In particular, for the resumption monad we ignore error
values, require the property when a value is produced, and quantify
over any interaction with the outside world:
\begin{quote}
\begin{lstlisting}
nlet rec P_io O I (A:Type) (P:A $\rightarrow$ Prop) (v:IO O I A) on v : Prop :=
match v return $\lambda$_.Prop with
[ Wrong $\Rightarrow$ True
| Value z $\Rightarrow$ P z
| Interact out k $\Rightarrow$ $\forall$v'.P_io O I A P (k v')
].
\end{lstlisting}
\end{quote}
We can use this lifting with the relations from the inductive
semantics to state soundness properties:
\begin{quote}
\begin{lstlisting}
ntheorem exec_step_sound: $\forall$ge,st.
P_io ??? ($\lambda$r. step ge st (\fst r) (\snd r)) (exec_step ge st).
\end{lstlisting}
\end{quote}
The proofs of these theorems use case analysis over the state, a few
lemmas to break up the expressions in the monad and the other
soundness results to form the corresponding derivation in the
inductive semantics.
We experimented with a different way of specifying soundness using
dependent types:
\begin{quote}
\begin{lstlisting}
nlet rec exec_step (ge:genv) (st:state) on st
: (IO eventval io_out ($\Sigma$r:trace $\times$ state. step ge st (\fst r) (\snd r))) ≝
\end{lstlisting}
\end{quote}
Note the $\Sigma$ type for the result of the function, which shows that
successful executions are sound with respect to the inductive semantics.
Matita automatically generates proof obligations for each case due to a
coercion between the types
\begin{center}
\lstinline'option (res T)' \quad and \quad \lstinline'res ($\Sigma$x:T. P x)'
\end{center}
(where a branch marked \lstinline'None' would generate a proof obligation to
show that it is impossible, although the semantics do not use this feature).
This is intended to mimic Sozeau's \textsc{Russell} language and elaboration
into Coq~\cite{sozeau06}. The coercion also triggers an automatic mechanism
in \matita{} to add equalities for each pattern matched. The proofs
are essentially the same as before.
However, the soundness proofs then pervade the executable semantics,
making rewriting in the correctness proofs more difficult. We decided
to keep the soundness results separate, partly because of the
increased difficulty of using the resulting terms in proofs, and
partly because they are of little consequence once equivalence has
been shown.
The completeness results requiring a dual lifting which requires the
term to reduce to a particular value, allowing for resumptions with
existential quantification:
\begin{quote}
\begin{lstlisting}
nlet rec yieldsIO (A:Type) (a:IO io_out io_in A) (v':A) on a : Prop :=
match a with
[ Value v $\Rightarrow$ v' = v
| Interact _ k $\Rightarrow$ $\exists$r.yieldsIO A (k r) v'
| _ $\Rightarrow$ False
].
\end{lstlisting}
\end{quote}
We then show the completeness theorems, such as
\begin{quote}
\begin{lstlisting}
ntheorem step_complete: $\forall$ge,s,tr,s'.
step ge s tr s' $\rightarrow$ yieldsIO ? (exec_step ge s) $\langle$tr,s'$\rangle$.
\end{lstlisting}
\end{quote}
by case analysis on the inductive derivation and a mixture of
reduction and rewriting. Thus we know that executing a step in these
semantics is equivalent to a step in the inductive semantics.
Showing the equivalence of whole program execution is a little
trickier. Our executable semantics produces a coinductive
\lstinline'execution' which is really a tree of executions, branching
at each I/O resumption on the input value:
\[
\begin{array}{rcl@{\;}l}
& & \mathsf{e\_step}\ t_i\ s_i \rightarrow &\cdots\ \mathsf{e\_stop}\ t_j\ i\ m \\
& \nearrow & \quad\quad \vdots \\
\mathsf{e\_step}\ \mathsf{E0}\ s_0 \rightarrow \mathsf{e\_step}\ t_1\ s_1 \rightarrow \cdots \mathsf{e\_interact}\ o_1\ k_1 & \rightarrow & \mathsf{e\_step}\ t_i'\ s_i' \rightarrow &\cdots\ \mathsf{e\_step}\ t_j\ s_j \rightarrow \cdots \\
& \searrow & \quad\quad\vdots \\
&& \mathsf{e\_step}\ t_i''\ s_i'' \rightarrow &\cdots\ \mathsf{e\_wrong}
\end{array}
\]
Each \textsf{e\_step} comes with the trace (often the empty trace,
\textsf{E0}) and current state. Each branch corresponds to calling
the continuation $k_1$ with a different input value. We use the
\lstinline'single_exec_of' predicate to identify single executions from
these trees, essentially fixing a stream of input values.
However, the inductive semantics divides program behaviours into four
categories which have \emph{individual} (co)inductive descriptions:
\begin{itemize}
\item successfully terminating executions;
\item programs which eventually diverge (with an empty trace);
\item programs which keep interacting in some way (with an infinite
trace)\footnote{In our setting this includes passing through cost
labels as well as I/O.}; and
\item programs which \emph{go wrong}.
\end{itemize}
We cannot constructively decide which of these categories an execution
can fit into because the properties they describe are undecidable.
Hence we follow CompCert's approach for showing that one of the
behaviours always exists using classical logic. Thus we characterise
the executions, then show the existence of the inductive semantics'
behaviour that matches. We limit the scope of classical reasoning by
taking the relevant axioms as hypotheses:
\begin{quote}
\begin{lstlisting}
ntheorem exec_inf_equivalence:
$\forall$classic:($\forall$P:Prop.P $\vee$ $\neg$P).
$\forall$constructive_indefinite_description:($\forall$A:Type. $\forall$P:A$\rightarrow$Prop. ($\exists$x. P x) $\rightarrow$ $\Sigma$x:A. P x).
$\forall$p,e. single_exec_of (exec_inf p) e $\rightarrow$
$\exists$b.execution_matches_behavior e b $\wedge$ exec_program p b.
\end{lstlisting}
\end{quote}
\subsection{Animation of simple C programs}
We are currently working with a development version of \matita{} which
(temporarily) does not support extraction to OCaml code. Hence to
animate a program we first parse it with CIL and produce a \matita{}
term in text format representing the program, then interpret it within
\matita{}.
This process is rather laborious, so we have concentrated on testing
small programs which exercised areas of the semantics which depart
from CompCert. In particular, we tested several aspects of the
handling of memory spaces and the casting of pointers. Together with
the results in the previous sections we gain considerable confidence
that the semantics describe the behaviour of programs properly.
Nevertheless, we intend to experiment with larger C programs once
extraction is available.
To give a concrete example, the following C program reads an integer
using an `external' function and returns its factorial as the
program's exit value:
\begin{lstlisting}[language=C]
int get_input(void);
int main(void) {
int i = get_input();
int r = 1;
int j;
for (j = 2; j<=i; j++)
r = r * j;
return r;
}
\end{lstlisting}
The Clight code is essentially the same, and the \matita{} term is:
\begin{lstlisting}
ndefinition myprog := mk_program fundef type
[mk_pair ?? (succ_pos_of_nat 132 (* get_input *))
(External (succ_pos_of_nat 132) Tnil (Tint I32 Signed));
mk_pair ?? (succ_pos_of_nat 133 (* main *)) (Internal (
mk_function (Tint I32 Signed ) [] [mk_pair ?? (succ_pos_of_nat 134) (Tint I32 Signed);
mk_pair ?? (succ_pos_of_nat 135) (Tint I32 Signed);
mk_pair ?? (succ_pos_of_nat 136) (Tint I32 Signed)]
(Ssequence
(Scall (Some ? (Expr (Evar (succ_pos_of_nat 134)) (Tint I32 Signed)))
(Expr (Evar (succ_pos_of_nat 132))
(Tfunction Tnil (Tint I32 Signed)))
[])
(Ssequence
(Sassign (Expr (Evar (succ_pos_of_nat 135)) (Tint I32 Signed))
(Expr (Econst_int (repr 1)) (Tint I32 Signed)))
(Ssequence
(Sfor (Sassign (Expr (Evar (succ_pos_of_nat 136)) (Tint I32 Signed))
(Expr (Econst_int (repr 2)) (Tint I32 Signed)))
(Expr (Ebinop Ole
(Expr (Evar (succ_pos_of_nat 136)) (Tint I32 Signed))
(Expr (Evar (succ_pos_of_nat 134)) (Tint I32 Signed)))
(Tint I32 Signed ))
(Sassign (Expr (Evar (succ_pos_of_nat 136)) (Tint I32 Signed))
(Expr (Ebinop Oadd
(Expr (Evar (succ_pos_of_nat 136)) (Tint I32 Signed))
(Expr (Econst_int (repr 1)) (Tint I32 Signed)))
(Tint I32 Signed)))
(Sassign (Expr (Evar (succ_pos_of_nat 135)) (Tint I32 Signed))
(Expr (Ebinop Omul
(Expr (Evar (succ_pos_of_nat 135)) (Tint I32 Signed))
(Expr (Evar (succ_pos_of_nat 136)) (Tint I32 Signed)))
(Tint I32 Signed)))
)
(Sreturn (Some ? (Expr (Evar (succ_pos_of_nat 135))
(Tint I32 Signed)))))))
))]
(succ_pos_of_nat 133)
[].
\end{lstlisting}
We can use the definitions in the \texttt{Animation.ma} file to reduce
the term for a given input (5, in this case; executing a maximum of 40 steps):
\begin{lstlisting}
nremark exec: result ? (exec_up_to myprog 40 [EVint (repr 5)]).
nnormalize; (* you can examine the result here *)
@; nqed.
\end{lstlisting}
The result state records the interaction (the \lstinline'EVextcall')
and the return result (120 in least-significant-bit first binary):
\begin{lstlisting}
result (res (list event$\times$state))
(OK (list event$\times$state)
$\langle$[(EVextcall (p1 (p0 (p1 (p0 (p0 (p0 (p0 one))))))) []
(EVint (mk_int (pos (p1 (p0 one))) (inrg_mod (pos (p1 (p0 one)))))))],
Returnstate (Vint (mk_int (pos (p0 (p0 (p0 (p1 (p1 (p1 one)))))))
(inrg_mod (pos (p0 (p0 (p0 (p1 (p1 (p1 one))))))))))
Kstop MEM $\rangle$)
\end{lstlisting}
\appendix
\section{Description of the Code}
The files ported from CompCert were based on version 1.6, with some
minor details taken from 1.7.1 (in particular, the parser and the
method of building infinite traces for the equivalence proof).
The majority of the semantics is given as \matita{} source files. The
exception is the changes to the CIL based parser, which is presented
as a patch to a preliminary version of the \cerco{} prototype
compiler. This patch provides both the extensions to the parser and a
pretty printer to produce a usable \matita{} term representing the
program.
\begin{quote}
\begin{tabular}{p{9em}l}
acc-0.1.spaces.patch & Changes to early prototype compiler for parsing
\end{tabular}
\end{quote}
\subsubsection*{Ancilliary definitions}
Files corresponding to CompCert.
\begin{quote}
\begin{tabular}{p{9em}l}
Coqlib.ma & Minor definitions ported from CompCert \\
Errors.ma & The error monad \\
Floats.ma & Axiomatised floating point numbers \\
Globalenvs.ma & Global environments \\
Integers.ma & Integers modulo powers of two \\
Maps.ma & Finite maps (used in particular for local environments) \\
Smallstep.ma & Generic definitions and lemmas for small step semantics \\
\end{tabular}
\end{quote}
\noindent
Files specific to this development.
\begin{quote}
\begin{tabular}{p{9em}l}
binary/positive.ma & Binary positive numbers \\
binary/Z.ma & Binary integers \\
extralib.ma & Extensions to \matita{}'s library \\
\end{tabular}
\end{quote}
\subsubsection*{Inductive semantics ported from CompCert}
\begin{quote}
\begin{tabular}{p{9em}l}
AST.ma & Minor syntax definitions intended for several compiler stages \\
Values.ma & Definitions for values manipulated by Clight programs \\
Mem.ma & Definition of the memory model \\
Events.ma & I/O events \\
CostLabel.ma & Definition of cost labels \\
Csyntax.ma & Clight syntax trees \\
Csem.ma & Clight inductive semantics \\
\end{tabular}
\end{quote}
\subsubsection*{Executable semantics}
\begin{quote}
\begin{tabular}{p{9em}l}
IOMonad.ma & Definitions of I/O resumption monad \\
Cexec.ma & Definition of the executable semantics \\
CexecSound.ma & Soundness of individual steps \\
CexecComplete.ma & Completeness of individual steps \\
CexecEquiv.ma & Equivalence of whole program executions \\
Animation.ma & Definitions to help test the semantics
\end{tabular}
\end{quote}
\bibliographystyle{plain}
\bibliography{report}
\end{document}