source: Papers/cpp-exec-2012/ccexec.tex

Last change on this file was 2382, checked in by campbell, 7 years ago

Final version of executable semantics paper.

File size: 43.0 KB
Line 
1\documentclass{llncs}
2\usepackage{hyperref}
3\usepackage{graphicx}
4\usepackage{color}
5\usepackage{listings}
6
7% NB: might be worth removing this if changing class in favour of
8% a built-in definition.
9%\newtheorem{theorem}{Theorem}
10
11\lstdefinelanguage{coq}
12  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
13   morekeywords={[2]if,then,else,forall,Prop,match,with,end,let},
14  }
15
16\lstdefinelanguage[mine]{C}[ANSI]{C}{
17  morekeywords={int8_t}
18}
19
20\lstset{language=[mine]C,basicstyle=\footnotesize\tt,columns=flexible,breaklines=false,
21        keywordstyle=\color{red}\bfseries,
22        keywordstyle=[2]\color{blue},
23        commentstyle=\color{green},
24        stringstyle=\color{blue},
25        showspaces=false,showstringspaces=false,
26        xleftmargin=1em}
27
28\begin{document}
29
30\title{An Executable Semantics for CompCert C\thanks{The project CerCo acknowledges the financial support of the Future and
31Emerging Technologies (FET) programme within the Seventh Framework
32Programme for Research of the European Commission, under FET-Open grant
33number: 243881}}
34\author{Brian Campbell}
35\institute{LFCS, University of Edinburgh,\\\email{Brian.Campbell@ed.ac.uk}}
36\maketitle
37\begin{abstract}
38CompCert is a C compiler developed by Leroy et al, the
39  majority of which is formalised and verified in the Coq proof
40  assistant.  The correctness theorem is defined in terms of a
41  semantics for the `CompCert C' language, but how can we gain faith
42  in those semantics?  We explore one approach: building an equivalent
43  executable semantics that we can check test suites of code
44  against.
45\end{abstract}
46
47Flaws in a compiler are often reflected in the output they produce:
48buggy compilers produce buggy code.  Moreover, bugs can evade testing
49when they only manifest themselves on source code of a particular shape,
50when particular optimisations are used.  This has made compilers an
51appealing target for mechanized verification, from early work such as
52Milner and Weyhrauch's simple compiler in LCF~\cite{MilnerWeyhrauch72}
53to modern work on practical compilers, such as the Verisoft C0
54compiler~\cite{Leinenbach200823} which is used as a key component of a
55larger verified software system.
56
57% TODO: does the NICTA work do anything like this?
58
59The CompCert project~\cite{Leroy:2009:FVR:1538788.1538814} has become
60a nexus of activity in recent years, including the project members'
61own efforts to refine the compiler and add certified
62optimisations~\cite{1328444,springerlink:10.1007/978-3-642-11970-5_13,springerlink:10.1007/978-3-642-28869-2_20},
63and external projects such as an extensible optimisation
64framework~\cite{Tatlock:2010:BEV:1806596.1806611} and compiling
65concurrent programs~\cite{Sevcik:2011:RCV:1926385.1926393}.  The main
66result of the project is a C compiler which targets the assembly
67languages for a number of popular processor architectures.  Most of
68the compiler is formalised in the Coq proof
69assistant % TODO: consider citation?
70and accompanied by correctness results, most notably proving that any
71behaviour given to the assembly output must be an acceptable behaviour
72for the C source program.
73
74To state these correctness properties requires giving some semantics to
75the source and target languages.  In CompCert these are given by
76inductively defined small-step relations.  Errors and omissions in
77these semantics can weaken the overall theorems, potentially masking
78bugs in the compiler.  In particular, if no semantics is given to a
79legitimate C program then any behaviour is acceptable for the
80generated assembly, and the compiler is free to generate bad
81code\footnote{By design, the compiler is also free to fail with an
82  error for \emph{any} program, even if it is well-defined.}.  This
83corresponds to the C standard's notion of \emph{undefined behaviour},
84but this notion is left implicit in the semantics, so there is a
85danger of underdefining C.
86
87% TODO: clight paper also mentions executable semantics, in some detail
88Several methods could be used to gain faith in the semantics; in the
89conclusions of~\cite{Leroy:2009:FVR:1538788.1538814} Leroy suggests
90manual review, testing of executable versions and proving connections
91to alternative forms of semantics.  Here we investigate testing an
92executable semantics which, by construction, is closely and formally
93related to CompCert's input language.  The testing is facilitated by
94Coq's extraction mechanism, which produces OCaml code corresponding to
95the executable semantics that can be integrated with CompCert's
96existing parser to produce a C interpreter.
97
98Our motivation for this work comes principally from the CerCo project
99where we use executable semantics throughout~\cite{Amadio2011175}, and in particular have
100an executable version of CompCert's Clight intermediate
101language~\cite{EDI-INF-RR-1412}.  There we found an omission involving function
102pointers during testing.  We wished to attempt to replicate this in
103CompCert C and search for any other evident problems.  An executable
104semantics is far more practical for this task than manual animation of
105the inductive definitions\footnote{This is partly because manual
106  animation requires providing witnesses for the results or careful
107  management of existential metavariables, and partly because we can
108  prove results once for the executable semantics rather than once per
109  program.  Automatic proof search has similar problems.}.
110Morever, CompCert C is a more complex and interesting language than
111Clight, especially due to the presence of side-effects and
112non-determinism in the evaluation of expressions.
113
114The main contributions of this paper are
115\begin{enumerate}
116\item demonstrating the successful retrofitting of a small-step
117  executable semantics to an existing verified compiler with
118  equivalence proofs against the original semantics,
119\item showing that testing of the resulting interpreter does lead to
120  the identification of bugs in both the semantics and the compiler,
121\item showing that the executable semantics can illustrate the
122  limitations of the semantics, and fill in a gap in the relationship
123  between the original deterministic and non-deterministic versions of
124  the semantics, and
125\item demonstrate that a mixture of formal Coq and informal OCaml code can make
126  testing more effective by working around known bugs with little
127  effort.
128\end{enumerate}
129% TODO: consider moving out of footnote to make it less footnote heavy
130The full development is available online as a modified version of CompCert\footnote{\url{http://homepages.inf.ed.ac.uk/bcampbe2/compcert/}}.
131
132In Section~\ref{sec:compcert} we will give an overview of the relevant
133parts of the CompCert compiler.  In Section~\ref{sec:construction} we
134will discuss the construction of the executable semantics, its
135relationship to the inductively defined semantics from CompCert (both
136formalised in Coq) and
137the additional OCaml code used to support testing.  This is followed by the
138results of testing the semantics in Section~\ref{sec:testing},
139including descriptions of the problems encountered.
140Finally, Section~\ref{sec:related} discusses related work, including
141an interpreter Leroy added to newer versions of CompCert following
142this work.
143
144\section{Overview of the CompCert compiler}
145\label{sec:compcert}
146
147CompCert compiles a simplified but substantial subset of the C
148programming language to one of three different architectures (as of
149version 1.8.2, which we refer to throughout unless stated otherwise).
150The main stages of the compiler, from an abstract syntax tree of the
151`CompCert C' language to target specific pseudo-assembly code, are
152written in the Calculus of Inductive Constructions (CIC) --- the
153dependently-typed language that underlies the Coq proof
154assistant~\cite{coq83}.  The formal proofs of correctness refer to
155these definitions.  Coq's extraction facilities~\cite{coqextr02}
156produce OCaml code corresponding to them, which can be
157combined with a C parser and assembler text generation to complete the
158compiler.  The resulting compilation chain is summarised in
159Figure~\ref{fig:chain}.
160\begin{figure}
161\begin{center}
162\includegraphics{cc-diagram.pdf}
163\end{center}
164\caption{Start and end of the CompCert compilation chain}
165\label{fig:chain}
166\end{figure}
167
168Note that the unformalised C parser is not a trivial piece of code.
169Originally based on the CIL library for C parsing~\cite{cil02} but
170heavily adapted, it not only produces an abstract syntax tree, but
171also calculates type information and can perform several
172transformations to provide partial support for features that are not
173present in the formalised compiler.  In particular, bitfields,
174structure passing and assignment, and \lstinline'long double' and
175\lstinline'long long' types are (optionally) handled here.  There is a
176further transformation to simplify the annotated C abstract syntax
177tree into the CompCert C language.
178
179Earlier versions of CompCert did not feature CompCert C at all, but the
180less complex Clight language.  Clight lives on as the next
181intermediate language in the compiler.  Hence several publications on
182CompCert refer to this
183language~\cite{springerlink:10.1007/s10817-009-9148-3}.  The major
184difference between the languages is that the syntax and semantics of
185expressions in Clight are much simpler because they are deterministic
186and side-effect free.  Moreover, CompCert's original semantics were in
187big-step form and lacked support for \texttt{goto} statements.
188Version 1.5 added these using a small-step semantics, and 1.8 added the
189CompCert C language with its C-like expressions.  The latter
190effectively moved some of the work done by the OCaml parser into the
191formalised compiler.
192% TODO: mention parser changes in 1.7?
193% TODO: In response to Csmith?
194
195CompCert C is also interesting because it has two semantics: one with
196non-deterministic behaviour for expressions which models the
197uncertainty about evaluation order in C, and a deterministic semantics
198which resolves this uncertainty.  The former can be regarded as the
199intended input language of the compiler, and the latter as a more
200specific version that the compiler actually implements.  In addition,
201these semantics have slightly different forms: in particular the
202non-deterministic version is always small-step, whereas the
203deterministic form makes a single big-step for side-effect free
204subexpressions.  The two are connected by a result showing that
205programs which are \emph{safe} in the non-deterministic semantics
206admit some execution in the deterministic semantics, where by
207\emph{safe} we mean that the program never gets stuck regardless of
208the order of evaluation used (for a fixed choice of I/O behaviour).
209This result essentially holds because the deterministic semantics is
210choosing one of the possible evaluation orders.
211
212Note that throughout the definitions of all of these semantics we
213benefit from a specialisation of C to the targets that CompCert
214supports.  Various \emph{implementation-defined} parts of the language
215such as integer representation and sizes are fixed (in fact, the
216formalisation heavily uses the assumption of a 32-bit word size).
217% which is the largest integer type supported, making long long a pain
218Also, some \emph{undefined behaviour} is given a meaning; notably
219mixtures of reads and writes to a variable in a single expression are
220given a non-deterministic behaviour rather than ruled out as per the C
221standard~\cite[\S 6.5.16]{c99}.
222
223\section{Construction of the executable semantics}
224\label{sec:construction}
225
226The executable semantics consists of several parts: existing
227executable definitions from CompCert, functions which closely
228correspond to the original relational semantics, a mechanism for
229resolving non-deterministic behaviour, and finally the proofs that
230steps of the relational semantics are equivalent to steps of the
231executable semantics.  We do not consider whole program executions or
232interaction with the outside world (I/O) here, but only individual
233steps of the semantics.  It is not difficult to include these (we have done
234this for Clight as part of the CerCo project~\cite{EDI-INF-RR-1412}), but we do
235not expect that it would yield any worthwhile insights during testing
236to justify the effort involved arising from the extra non-determinism
237from I/O and the coinductive reasoning for non-terminating programs.
238
239The existing definitions that we can reuse include the memory model
240that is used throughout the CompCert development~\cite{compcert-mm08}.
241This model features symbolic pointers with discrete abstract blocks of
242memory, and so is more suitable for our purposes than non-executable
243alternatives such as a concrete model with non-deterministic choice of
244fresh locations for allocations.  The semantics of operations on
245values (for example, \texttt{+} and \texttt{==}) are naturally defined
246as auxiliary functions in the relational semantics, which we reuse.
247Casts and the conversion of values to booleans are exceptions to this;
248they are defined inductively in the relational semantics and we treat
249them in the same way as the rest of the inductive definitions, below.
250
251Local and global variable environments are necessarily executable in
252CompCert because they are used throughout the actual compiler code in
253addition to the semantics.  We also make use of the compiler's error
254monad to recover the partiality of the relational semantics (that is,
255CIC is a language of total functions, so we must model failure
256explicitly).
257
258The environments are efficiently implemented because of their use in
259the compiler itself.  However, the memory model was not intended to be
260efficient and builds large chains of closures to represent changes to
261memory.  Fortunately, the performance was sufficient for our testing,
262so we did not attempt to optimise it.
263
264\subsection{Executing a step of the semantics}
265
266The relational semantics uses a series of inductive definitions to
267describe the steps of the abstract machine: one for reducing
268\emph{lvalues}, expressions which represent a location that can be
269read from, written to, or extracted as a pointer; one for
270\emph{rvalues} expressions that only yield a value, such as addition;
271and one for setting up function calls:
272\begin{lstlisting}[language=coq]
273Inductive lred: expr -> mem -> expr -> mem -> Prop := ...
274Inductive rred: expr -> mem -> expr -> mem -> Prop := ...
275Inductive callred: expr -> fundef -> list val -> type -> Prop := ...
276\end{lstlisting}
277The \lstinline'estep' relation represents whole state transitions
278corresponding to the reductions on expressions given by the first
279three relations, and the \lstinline'sstep' relation performs steps of
280statements and function entry and exit:
281\begin{lstlisting}[language=coq]
282Inductive estep: state -> trace -> state -> Prop := ...
283Inductive sstep: state -> trace -> state -> Prop := ...
284\end{lstlisting}
285The union of these two relations gives the overall \lstinline'step' relation.
286
287The executable semantics uses functions corresponding to each of these
288relations.  The syntax directed nature of the inductive definitions
289makes this task straightforward.  For example, consider the
290\lstinline'rred' rules for conditional expressions:
291\begin{lstlisting}[language=coq]
292  | red_condition_true: forall v1 ty1 r1 r2 ty m,
293      is_true v1 ty1 -> typeof r1 = ty ->
294      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
295           (Eparen r1 ty) m
296  | red_condition_false: forall v1 ty1 r1 r2 ty m,
297      is_false v1 ty1 -> typeof r2 = ty ->
298      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
299           (Eparen r2 ty) m
300\end{lstlisting}
301These state that a true conditional reduces to its left subexpression
302and a false one to its right, subject to a typing constraint.
303
304We rearrange the rules to fit in a tree of pattern matching, check
305type equalities where necessary, and introduce error messages where no
306reduction is possible:
307\begin{lstlisting}[language=coq]
308Definition exec_rred (e:expr) (m:mem) : res (expr * mem) :=
309match e with
310...
311| Econdition (Eval v1 ty1) r1 r2 ty =>
312    do b <- exec_bool_val v1 ty1;
313    let r := if b then r1 else r2 in
314    match type_eq (typeof r) ty with
315    | left _ => OK (Eparen r ty, m)
316    | right _ => Error (msg "type mismatch in Econdition")
317    end
318\end{lstlisting}
319This transforms the value \lstinline[language=coq]'v1' to a boolean,
320selects the appropriate subexpression and checks its type.  The
321`\lstinline[language=coq]'do'' notation is a use of the error monad:
322the conversion to a boolean is not always defined and may return an
323error, which the monad propogates to the caller of
324\lstinline[language=coq]'exec_rred'.  The auxiliary relations defining
325the relationship between values and booleans,
326\lstinline[language=coq]'exec_bool_val', and casting are also defined
327as functions.
328
329Defining a function for the statement reduction, \lstinline'sstep', is
330similar.  One part of function entry handling deserves particular
331attention.  CompCert has a notion of \emph{external functions}, which
332are a set of primitive CompCert C functions that are handled outside
333of the program to provide system calls, dynamic memory allocation,
334volatile memory accesses and annotations.  Inductive predicates
335describe what changes to the program's state are allowed.  We have
336only implemented the dynamic memory allocation (by constructing a
337function similar to the predicates for \lstinline'malloc' and
338\lstinline'free'), and provide a predicate on states to identify uses
339of the other calls so that we may give a partial completeness result
340in the next section.
341
342The expression relation \lstinline'estep' is more difficult.  This is
343the point in the semantics at which non-deterministic behaviour
344appears --- any subexpression in a suitable \emph{context} can be
345reduced.  The contexts are defined by an inductive description of
346which subexpressions can be evaluated in the current state (for
347example, only the guard of a conditional expression can be evaluated
348before the conditional expression itself, whereas either subexpression
349of $e_1$ \lstinline'+' $e_2$ can be reduced).  We could execute the
350non-deterministic semantics by collecting all of the possible
351reductions in a set (in the form of the possible successor states), but we are primarily interested in testing the
352semantics on well behaved C programs, and an existing theorem in
353CompCert shows that any program with a \emph{safe} non-deterministic
354behaviour also has a behaviour in the deterministic semantics.
355
356Hence we are satisfied with following a single path of execution, and
357so parametrise the executable semantics by a \emph{strategy} function
358which determines the evaluation order by picking a particular
359subexpression, and also returns its context.
360These contexts should be valid with respect to the inductive definition
361mentioned above.
362Moreover, if we choose a function matching the deterministic semantics
363then we know that any well-defined C program which goes wrong is
364demonstrating a bug (in the form of missing behaviour) which affects \emph{both} sets of semantics.  Regardless
365of the choice of strategy, we can used failed executions to pinpoint
366rules in the semantics which concern us.  We will discuss the
367implemented strategies and their implications in the following
368section. % TODO: make sure it is the following section later or change to a ref
369
370\label{sec:nonstuck}
371Finally, to match the non-deterministic semantics of \lstinline'estep' precisely we must
372also check that there are no \emph{stuck} subexpressions.  This
373detects subexpressions which cannot be reduced because they are
374erroneous, rather than because they are already values.  Normally the
375lack of any reducible subexpression would indicate an error, but the
376extra check is required because an accompanying non-terminating
377subexpression will mask it.  The CompCert documentation gives
378the example
379\lstinline[language=C]'f() + (10 / x)' where \lstinline'x' is
380initially 0 and \lstinline'f' is non-terminating.  Without the check,
381a program containing this expression would appear to be well-defined
382because \lstinline'f' can always be reduced, whereas we should make
383the program undefined because of the potential division by zero.
384
385Informally, the condition for \lstinline'estep' states that \emph{any subexpression in an
386  evaluation context is either a value, or has a further subexpression
387  that is reducible}.  This form suggests an inefficient executable
388approach: for every context perform a search for the reducible
389subexpression.  Fortunately it can be implemented by a
390relatively direct recursive function because we can reuse reducible
391subexpressions that we have already found in a larger context.  Thus
392we only have to check that an expression can be reduced (using the
393reduction functions discussed above) when all of the subexpressions have
394already been reduced to values.
395
396The deterministic semantics does not enforce this condition because it
397commits to one particular evaluation, regardless of whether other
398evaluations may fail.
399This means that the relationship between the non-deterministic and
400deterministic semantics is not a straightforward inclusion: the
401deterministic semantics accepts some executions where the stuckness
402check fails, but rejects the executions allowed by the
403non-deterministic semantics under a different expression
404reduction order.
405  Thus we make the executable version of the
406check optional so that we may implement either semantics.
407
408\subsection{Equivalence to the non-deterministic semantics}
409
410We show the equivalence of the executable semantics to the original
411relational semantics in two parts, summarising the proofs from the formal Coq development.  First, we show that any successful
412execution of a step corresponds to some derivation in the original
413semantics, regardless of the strategy we choose.
414
415\begin{theorem}[Soundness]
416Given a strategy which picks valid contexts, the execution of any step
417$s_1$ to $s_2$ with trace $t$ implies that there exists a derivation
418of \lstinline'step' $s_1$ $t$ $s_2$.
419\end{theorem}
420
421The proof consists of building similar lemmas for each relation in the
422semantics, using case analysis on the states, statements and
423expressions involved, reduction of the executable hypothesis
424and reasoning on equalities % TODO: woolly
425until the premises of the constructor are realised.  The check for
426\emph{stuck} subexpressions is the most involved task: we must prove a
427more general result differentiating between reducible and fully
428reduced expressions, and show that reducible terms found in
429subexpressions can be lifted.  If no such term exists, we show that
430attempting to reduce the whole term is sufficient.
431
432The second part is to show that any step of the original semantics is
433successfully executed by the executable semantics.
434
435\begin{theorem}[Completeness --- non-deterministic]
436For any derivation of \lstinline'step' $s_1$ $t$ $s_2$ which is not
437about to call an unsupported external function there exists a strategy
438$p$ such that
439\lstinline'exec_step' $p$ $s_1$ = \lstinline'OK' $(t,s_2).$
440\end{theorem}
441
442As in the soundness proof, a lemma is shown for each relation.  The general
443form of the proofs is to perform inversion on the hypothesis for the
444derivation and use rewriting and the prior lemmas to reduce the goal.
445Again, the stuckness check is more difficult:  it follows by induction
446on the expression, and showing that if a subexpression gets stuck then
447the parent expression gets stuck too.
448
449Note that each step might require a different strategy because we have
450restricted our attention to strategies that can be expressed as
451functions on the abstract machine state.  This rules out strategies
452where the execution order depends on (for example) previous states or
453randomness, but these do not occur in the CompCert compiler.
454
455\subsection{Strategies and the deterministic semantics}
456
457Two evaluation strategies were implemented for the executable
458semantics.  The first was a simple leftmost-innermost strategy that
459picks the first non-value subexpression that we are allowed to
460reduce.  It was chosen because it is simple to implement, sufficient
461for testing (any bugs were likely to be independent of evaluation
462order, and our experience with the second strategy supported this),
463and could be used as the basis for the second strategy.
464
465% TODO: have I included a discussion of missing behaviour vs wrong
466% behaviour vs extra behaviour?
467
468However, to provide the greatest benefit from the executable semantics
469we wanted to get executions which precisely match the deterministic
470semantics that the compiler actually implements.  That is, we wish to
471have an interpreter that predicts exactly the behaviour of the
472compiled program.  This is not entirely straightforward because the
473original deterministic semantics are formalised in a slightly
474different way to non-deterministic executions: expressions with no
475side effects are dealt with by a big-step relation, and effectful
476expressions use a small-step relation as before.
477Here `effectful' includes both changes to the state and to control flow such as the conditional operator.
478  This
479prioritisation of the effectful subexpressions is also what causes the
480difference in evaluation order.  For example,
481\lstinline[language=C]'x+(x=1)' when \lstinline'x' $\not = 1$ gives a different
482answer with the second strategy because the assignment is evaluated
483first.
484
485Despite the difference in construction, we can still encode the resulting evaluation order as a
486strategy:
487\begin{enumerate}
488\item find the leftmost-innermost \emph{effectful} subexpression (or
489  take the whole expression if it is effect-free), then
490\item
491  \begin{enumerate}
492  \item pick the leftmost-innermost \emph{effect-free} non-value
493    subexpression of that if one is present, otherwise
494  \item reduce the whole subexpression.
495  \end{enumerate}
496\end{enumerate}
497Intuitively, this works because the effect-free subexpressions can be
498reduced in any order that is allowed without changing the result.
499
500To show formally that this matches the relational semantics we first
501show that the big-steps of effect-free subexpressions can be
502decomposed into a series of small-steps using the relations from the
503non-deterministic semantics.  We can then show that iterating the
504executable semantics will perform the same steps, reaching the same
505end state as the big-step.  Then using some lemmas to show that the
506\lstinline'leftcontext's (the deterministic version of expression contexts) used in the semantics are found by
507our leftmost-innermost effectful subexpression strategy, we can show
508that any step of the deterministic semantics is performed by some
509number of steps of the executable semantics:
510\begin{theorem}[Completeness --- deterministic]
511  For any derivation of \lstinline'step' $s_1$ $t$ $s_2$ in the
512  deterministic semantics which is not about to call an unsupported
513  external function, repeated use of the \lstinline'exec_step'
514  function starting with $s_1$ using the above strategy yields the
515  state $s_2$.  Moreover, the concatenation of the resulting traces is
516  $t$.
517\end{theorem}
518Again, a precise version of the above argument is formalised in Coq.
519
520Together with the connections between the non-deterministic semantics
521and the executable semantics, and the result from CompCert linking the
522two relational semantics we get the relationships depicted in
523Figure~\ref{fig:relationships}.
524\begin{figure}
525\begin{center}
526\includegraphics{semantics.pdf}
527\end{center}
528\caption{Relationships between the different CompCert C semantics}
529\label{fig:relationships}
530\end{figure}
531We can see that the new result completes a loop --- confirming the
532intuition that the deterministic semantics is the same as the
533non-deterministic semantics with a particular choice of evaluation order,
534but without the `stuckness-check'.
535% TODO: describe the diagram's meaning a little more?
536
537We also see directly from the executable semantics that the evaluation
538order is the \emph{only} possible source of non-determinism besides
539I/O, because each step of the semantics can be expressed as a function
540rather than a relation.
541
542\subsection{Informal OCaml code}
543\label{sec:driver}
544
545To produce the actual interpreter we add some OCaml code which uses the
546existing CompCert parser to produce CompCert C from C source code, and then
547iterates the step function extracted from Coq until the program successfully
548completes, or fails with an error.
549
550Ideally we would fix each bug that we encounter when testing this
551interpreter, but this can involve repairing the specifications, the
552compiler, and the proofs; all of which CompCert's developers are
553better placed to deal with.  Nonetheless, we wish to continue testing
554in spite of any bugs found.  To achieve this we detect troublesome
555states in the OCaml main loop and override the normal behaviour of the
556executable semantics without changing the formal development.  To
557assist in the implementation of these workarounds we use a higher order
558function to apply a local change to an expression to any applicable
559subexpression appearing in an evaluation context.
560
561We also take the opportunity to add some support functions that are
562outside the scope of the semantics, but which are informally supported by the
563compiler.  They are implemented by detecting states which call these
564functions and running an OCaml version instead.  We provide
565\lstinline'exit' and \lstinline'abort' (trivial functions to halt
566execution), \lstinline'memcpy' and \lstinline'memcmp' (impossible to
567implement in CompCert C because the memory model does not provide
568byte-by-byte representations of pointers\footnote{Implementation of
569  \lstinline'memcpy' and \lstinline'memcmp' was assisted by the fact
570  that Coq's extraction process exposes the internal representation of
571  the memory model, rather than forcing us to implement
572  them through the model's interface.}) and a limited version of
573\lstinline'printf'.  CompCert's semantics also assume that the entire
574program is present in a single file, so we provide some support for
575merging definitions from multiple files to simulate separate
576compilation.
577
578All of these can be added without requiring any extra proof or
579changing the compiler or the semantics.  They can also be turned off
580to be sure that problems we encounter come from the formalised
581semantics and not the workarounds.
582
583Finally, we note that the execution of the semantics can be examined
584in the debugger packaged with OCaml when absolute certainty about the
585cause of a problem is required.  Indeed, during the testing below the
586debugger was used to retrieve extra information about the program
587state that was not included in the interpreter's error message.
588
589
590\section{Testing}
591\label{sec:testing}
592
593After some basic testing to ensure that the interpreter was
594functioning correctly (in particular, testing the informal driver code),
595we proceeded with the example that illustrated an omission in CerCo's
596Clight semantics, then attempted more demanding tests.
597
598\subsection{Function pointers}
599
600The simple program
601\begin{lstlisting}[language=C]
602int zero(void) { return 0; }
603
604int main(void) {
605  int (*f)(void) = zero;
606  return f();
607}
608\end{lstlisting}
609returns zero by a call to the \lstinline'zero' function via a function
610pointer.  It failed during testing of CerCo's executable semantics due
611to a bad type check, and CompCert C was suspected to suffer from the
612same problem.  Indeed, execution fails at the call with the same type
613error: the variable \lstinline'f''s type is a pointer to a function and the
614semantics only accepts a function type.
615
616There are two noteworthy aspects to this bug.  First, the compiler
617itself features the correct type check and works perfectly.  The error
618only affects the specification of the compiler; this is extra
619behaviour that has not been proved correct. The
620semantics can be easily modified to use the same check, and the only
621other change required is to \emph{remove} some steps from the proof
622scripts.  Second, the bug only became relevant from version 1.7 of
623CompCert --- before which the parser always added an explicit
624dereference.  This illustrates a general issue: the intended semantics
625of CompCert C is unclear because it is difficult to understand what we
626should assume about the output of the complex parser.
627
628\subsection{Csmith}
629
630Csmith is a tool for the random generation of test cases for C
631compilers that uses a mixture of static and dynamic checks to ensure
632that the generated code has well-defined
633behaviour~\cite{Yang:2011:FUB:1993498.1993532}.  Failing test cases
634are identified by the compiler failing with a crash or error, the
635generated code crashing, or by comparing the output of the test case
636with that obtained using different compilers.  Each generated test
637case outputs a checksum to summarise its runtime behaviour for
638comparison.
639
640Csmith is particularly interesting because it has already been used to
641test the CompCert compiler as a whole where it detected several bugs
642in the informal part of the compiler and a mismatch between the
643assembler output and the assembler's behaviour.  These problems have
644since been corrected, partly by the introduction of CompCert C as the
645input language.  Thus it gives us the opportunity to compare the
646semantics against the actual compiler.  However, it is not an ideal
647tool for testing semantics because it focuses on detecting
648`middle-end' bugs, and thus only generates a limited range of
649front-end language features.
650
651Given the previous testing of the compiler, it was unsurprising that
652we found no problems when executing the randomly generated code with
653the interpreter and comparing the results against the
654compiler\footnote{Using Csmith version 2.0.}.  However, we did
655encounter a failure with the \emph{non-random} support code which
656implements the dynamic checks that prevent undefined behaviour in
657arithmetic operations.  The failing code can be seen in
658Figure~\ref{fig:csmithfail}.
659\begin{figure}
660\begin{lstlisting}
661int8_t lshift_func_int8_t_s_s(int8_t left, int right)
662{
663  return
664    ((left < 0) ||
665     (((int)right) < 0) ||
666     (((int)right) >= 32) ||
667     (left > (INT8_MAX >> ((int)right)))) ?
668
669    left :
670
671    (left << ((int)right));
672}
673\end{lstlisting}
674\caption{Excerpt of the safe arithmetic code from Csmith (edited for readability)}
675\label{fig:csmithfail}
676\end{figure}
677
678 The failure is caused by the reduction rules for conditional expressions:
679\begin{lstlisting}[language=coq]
680  | red_condition_true: forall v1 ty1 r1 r2 ty m,
681      is_true v1 ty1 -> typeof r1 = ty ->
682      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
683           (Eparen r1 ty) m
684  | red_condition_false: forall v1 ty1 r1 r2 ty m,
685      is_false v1 ty1 -> typeof r2 = ty ->
686      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
687           (Eparen r2 ty) m
688\end{lstlisting}
689Each rule requires the type of the chosen subexpression to equal the
690type of the entire expression.  However, here one branch is an 8-bit
691integer, and the other a 32-bit integer so one of the rules cannot be
692applied.  The C standard requires that `the usual arithmetic
693conversions' should be applied to coerce the result to a common
694type~\cite[\S6.5.15]{c99}, so the 8-bit integer should be promoted to
695a 32-bit one.
696
697As with the function pointers, this test works with the compiler
698because the two types have a common representation --- all integers
699are handled as 32-bit words in CompCert.  It is only when they are
700loaded and stored to memory that the integer size is taken into
701account.  Nonetheless, this failure provides us with enough
702information to construct an example on which the compiler fails
703because the representations differ:
704\begin{lstlisting}[language=C]
705double f(int x, int a, double b) {
706  return x ? a : b;
707}
708\end{lstlisting}
709Fortunately, the compiler performs some type reconstruction for the RTL
710intermediate language which makes it produce an error rather than
711bad code.  Curiously, this means that the overall correctness theorem
712still holds if the semantics are corrected because the compiler
713is always allowed to fail, even on reasonable programs.  However, the
714intermediate results about the front-end become false because the bug
715is present much earlier in the compiler than the type reconstruction.
716
717Alleviating this bug was essential for proper testing with Csmith, but
718a full solution requires fixing the semantics, compiler and proofs and
719is beyond the scope of this work.  By implementing an alternative
720reduction rule in OCaml as a workaround we were able to continue
721testing without difficulty.
722
723\subsection{gcc-torture}
724
725The GCC compiler contains an extensive test suite for several of its
726supported languages~\cite[\S 6.4]{gccint4.4}.  The gcc-torture suite
727contains an executable subset of the tests in C.  Many of the test
728cases use GCC-specific features, but we are able to reuse a subset
729selected by another executable C semantics project,
730\texttt{kcc}~\cite{Ellison:2012:EFS:2103656.2103719}.  Unlike the
731Csmith generated tests, gcc-torture contains many specialised tests
732for corner-cases of the C standard.  In addition to finding errors,
733these also serve to highlight deliberate limitations in CompCert's
734semantics.
735
736A small test harness ran each test case with the executable
737semantics.  The failing cases were manually classified and in a few
738cases a workaround or fix to the parser was added to prevent known
739bugs hiding further issues.
740
741The tests revealed that the semantics were missing some minor casting
742rules and that zero initialisation of file scope variables was not
743performed when initialising the memory model.  Both were handled
744correctly by the compiler, in the latter case by the informal code
745that produces the assembler output.
746
747Several issues with the OCaml parser appeared: initialisation of an
748array by a shorter string did not add null-padding; arrays were not
749permitted to the left of a \lstinline'->' dereference; incomplete
750array types were not fully supported; and the reported line numbers in
751error messages could be wrong.  Ironically, the latter was caused by
752side-effects and OCaml's own non-deterministic evaluation order.
753
754Deliberate limitations of CompCert were strongly apparent in the
755result too.  Unsupported and partially supported constructs such as
756\lstinline'long long' integer types and bitfields caused many test
757case failures.  A more interesting case is that the comparison of
758pointers is undefined when one points just beyond the end of the
759object (which is explicitly permitted by the C standard), or when a
760function pointer is used.  However, we were unable to find any mention
761of this limitation in the documentation, and without testing would
762never have known about it.
763
764\section{Related work}
765\label{sec:related}
766
767The CompCert developers had already proposed implementing an
768executable semantics for the first version of Clight, but in a
769big-step style with a bound on the depth of the evaluation to ensure
770termination~\cite[\S 5.4]{springerlink:10.1007/s10817-009-9148-3}.
771However, it would be difficult to implement the workarounds described
772in Section~\ref{sec:driver} with a big-step semantics.
773
774Reaction to this work from the CompCert developers has been positive:
775bugs identified before the next release were fixed (in the case of the
776missing casts, independently before they were identified by testing),
777and Leroy has subsequently added a similar small-step interpreter.
778The new interpreter differs from the present work by focusing solely
779on the non-deterministic semantics and computing all of the successor
780states at once.  This greatly simplifies the checking and proof of
781non-stuckness described in Section~\ref{sec:nonstuck}, and provides an
782option to explore the entire space of evaluation orders (which is
783surprisingly tractable).  However, there is no easy way to mimic just
784the deterministic evaluation order, and no tricks in the driver code
785to make testing easier.  Together with the bug fixes, this makes
786directly comparing test results with the current work infeasible, although
787some brief testing with the gcc-torture suite behaved as expected.
788The latest version of CompCert, 1.11, also features a more efficient
789memory model, improving the interpreter's performance.
790
791% NB: if I change the formatting of Clight, change ClightTSO too
792\v{S}ev\v{c}\'{i}k et al.~\cite{Sevcik:2011:RCV:1926385.1926393} have
793created CompCertTSO, a derivative of an earlier version of CompCert
794which supports concurrent shared-memory programs with a \emph{Total
795  Store Ordering} (TSO) memory model.  The source language for the
796formalised part of the compiler is ClightTSO and is given an
797executable semantics in addition to the usual relational semantics,
798which `revealed a number of subtle errors.'  The main issues found in
799the present work did not arise because it was based on a version of
800CompCert that predated their introduction, and the pointer comparisons
801were not a problem because they were specified differently in
802CompCertTSO.
803
804Ellison and Ro{\c s}u~\cite{Ellison:2012:EFS:2103656.2103719} have
805developed a semantics for C based on rewriting logic that aims to be
806as close to the standard as possible, including many parts of the
807language that are not currently supported by CompCert.  An interpreter
808called \texttt{kcc} is derived from the semantics and has been tested
809against the gcc-torture test suite that we reused.  They go further
810and perform coverage analysis to gauge the proportion of the semantics
811exercised by the tests. These semantics
812are not used for compiler verification, but are intended for studying
813(and ultimately verifying) the behaviour of C programs.
814
815The Piton compiler~\cite{springerlink:10.1007/BF00243133} was
816formalised in the Boyer-Moore prover, which is an example of an
817environment where an executable semantics is the natural choice.  The
818Piton language is a very low-level systems language defined for the
819project, so they do not benefit from preexisting test suites.
820However, the work is particularly interesting due to the connected
821hardware formalisation of the target.  A later use of executable semantics
822in ACL2 demonstrated the usefulness of executable semantics in hardware
823verification because running an existing test suite for AMD's RTL language
824against the semantics was crucial for convincing managers that the model
825was relevant~\cite[\S 3]{springerlink:10.1007/3-540-49519-3_22}.
826
827Lochbihler and
828Bulwahn~\cite{springerlink:10.1007/978-3-642-22863-6_17} applied and
829extended Isabelle's code extraction and predicate
830compiler~\cite{springerlink:10.1007/978-3-642-03359-9_11} to animate
831the semantics of a multithreaded Java formalisation in Isabelle/HOL,
832JinjaThreads.  This is a very appealing approach because the predicate
833compiler deals with most of the burden of writing an executable
834version of an inductively defined semantics, although JinjaThreads
835still required a substantial amount of work to deal with some
836difficult definitions.  Their description of testing focuses on
837performance rather than correctness, but executable versions of the
838code have been tested on an ongoing basis since the earlier Jinja
839formalisation that it is based
840on~\cite{Klein:2006:MMJ:1146809.1146811}.
841
842% TODO: Norrish, C0, other exec sem?
843
844\section{Conclusions}
845
846We have shown that executable semantics can be useful enough for
847the necessary task of
848validating semantics to justify retrofitting them to existing
849verified compilers, and in the case of CompCert found a real compiler
850bug in the formalised front-end, alongside numerous minor issues.  It
851also illustrates that testing of the semantics can be more sensitive
852than testing the compiler: our original failing case for the
853conditional expressions bug would (and did) pass compiler testing,
854but using that failure in the semantics we were able to derive a test
855case that also failed in the compiler.
856
857In addition to the testing, the executable semantics were also useful
858for demonstrating the limitations of the semantics, both known and
859unknown.  Moreover, we were able to take the opportunity to prove that
860an intuitive relationship between the deterministic and
861non-deterministic semantics of CompCert C holds.
862
863\bibliographystyle{splncs03}
864\bibliography{ccexec}
865
866\end{document}
Note: See TracBrowser for help on using the repository browser.