source: Papers/cpp-exec-2012/ccexec.tex @ 2381

Last change on this file since 2381 was 2381, checked in by campbell, 8 years ago

Executable semantics paper as it was in the first submission.

File size: 40.6 KB
7% NB: might be worth removing this if changing class in favour of
8% a built-in definition.
12  {keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
13   morekeywords={[2]if,then,else,forall,Prop,match,with,end,let},
14  }
17  morekeywords={int8_t}
21        keywordstyle=\color{red}\bfseries,
22        keywordstyle=[2]\color{blue},
23        commentstyle=\color{green},
24        stringstyle=\color{blue},
25        showspaces=false,showstringspaces=false,
26        xleftmargin=1em}
30\title{An Executable Semantics for CompCert C\thanks{The project CerCo acknowledges the financial support of the Future and
31Emerging Technologies (FET) programme within the Seventh Framework
32Programme for Research of the European Commission, under FET-Open grant
33number: 243881}}
34\author{Brian Campbell}
35\institute{LFCS, University of Edinburgh,\\\email{}}
38CompCert is a C compiler developed by Leroy et al, the
39  majority of which is formalised and verified in the Coq proof
40  assistant.  The correctness theorem is defined in terms of a
41  semantics for the `CompCert C' language, but how can we gain faith
42  in those semantics?  We explore one approach: building an equivalent
43  executable semantics that we can check test suites of code
44  against.
47Flaws in a compiler are often reflected in the output they produce:
48buggy compilers produce buggy code.  Moreover, bugs can evade testing
49when they only manifest themselves on source code of a particular shape,
50when particular optimisations are used.  This has made compilers an
51appealing target for mechanized verification, from early work such as
52Milner and Weyhrauch's simple compiler in LCF~\cite{MilnerWeyhrauch72}
53to modern work on practical compilers, such as the Verisoft C0
54compiler~\cite{Leinenbach200823} which is used as a key component of a
55larger verified software system.
57% TODO: does the NICTA work do anything like this?
59The CompCert project~\cite{Leroy:2009:FVR:1538788.1538814} has become
60a nexus of activity in recent years, including the project members'
61own efforts to refine the compiler and add certified
63and external projects such as an extensible optimisation
64framework~\cite{Tatlock:2010:BEV:1806596.1806611} and compiling
65concurrent programs~\cite{Sevcik:2011:RCV:1926385.1926393}.  The main
66result of the project is a C compiler which targets the assembly
67languages for a number of popular processor architectures.  Most of
68the compiler is formalised in the Coq proof
69assistant % TODO: consider citation?
70and accompanied by correctness results.
72To state these correctness properties requires giving some semantics to
73the source and target languages.  In CompCert these are given by
74inductively defined small-step relations.  Errors and omissions in
75these semantics can weaken the overall theorems, potentially masking
76bugs in the compiler.  In particular, if no semantics is given to a
77legitimate C program the compiler is free to generate bad
78code\footnote{By design, the compiler is also free to fail with an
79  error for \emph{any} program, even if it is well-defined.}.
81% TODO: clight paper also mentions executable semantics, in some detail
82Several methods could be used to gain faith in the semantics; in the
83conclusions of~\cite{Leroy:2009:FVR:1538788.1538814} Leroy suggests
84manual review, testing of executable versions and proving connections
85to alternative forms of semantics.  Here we investigate testing an
86executable semantics which, by construction, is closely and formally
87related to CompCert's input language.  The testing is facilitated by
88Coq's extraction mechanism, which produces OCaml code corresponding to
89the executable semantics that can be integrated with CompCert's
90existing parser to produce a C interpreter.
92Our motivation for this work comes principally from the CerCo project
93where we use executable semantics throughout~\cite{Amadio2011175}, and in particular have
94an executable version of CompCert's Clight intermediate
95language~\cite{EDI-INF-RR-1412}.  There we found an omission involving function
96pointers during testing.  We wished to attempt to replicate this in
97CompCert C and search for any other evident problems.  An executable
98semantics is far more practical for this task than manual animation of
99the inductive definitions\footnote{This is partly because manual
100  animation requires providing witnesses for the results or careful
101  management of existential metavariables, and partly because we can
102  prove results once for the executable semantics rather than once per
103  program.  Automatic proof search has similar problems.}.
104Morever, CompCert C is a more complex and interesting language than
105Clight, especially due to the presence of side-effects and
106non-determinism in the evaluation of expressions.
108The main contributions of this paper are
110\item demonstrating the successful retrofitting of a small-step
111  executable semantics to an existing verified compiler with
112  equivalence proofs against the original semantics,
113\item showing that testing of the resulting interpreter does lead to
114  the identification of bugs in both the semantics and the compiler,
115\item showing that the executable semantics can illustrate the
116  limitations of the semantics, and fill in a gap in the relationship
117  between the original deterministic and non-deterministic versions of
118  the semantics, and
119\item demonstrate that a mixture of formal and informal code can make
120  testing more effective by working around known bugs with little
121  effort.
124In Section~\ref{sec:compcert} we will give an overview of the relevant
125parts of the CompCert compiler.  In Section~\ref{sec:construction} we
126will discuss the construction of the executable semantics, its
127relationship to the inductively defined semantics from CompCert and
128the additional OCaml code used to support testing.  This is followed by the
129results of testing the semantics in Section~\ref{sec:testing},
130including descriptions of the problems encountered.
131Finally, Section~\ref{sec:related} discusses related work.
133\section{Overview of the CompCert compiler}
136CompCert compiles a simplified but substantial subset of the C
137programming language to one of 3 different architectures (as of
138version 1.8.2, which we refer to throughout unless stated otherwise).
139The main stages of the compiler, from an abstract syntax tree of the
140`CompCert C' language to target specific pseudo-assembly code, are
141written in the Calculus of Inductive Constructions (CIC) --- the
142dependently-typed language that underlies the Coq proof
143assistant~\cite{coq83}.  The formal proofs of correctness refer to
144these definitions.  Coq's extraction facilities~\cite{coqextr02}
145produce OCaml code corresponding to them, which can be
146combined with a C parser and assembler text generation to complete the
147compiler.  The resulting compilation chain is summarised in
153\caption{Start and end of the CompCert compilation chain}
157Note that the unformalised C parser is not a trivial piece of code.
158Originally based on the CIL library for C parsing~\cite{cil02} but
159heavily adapted, it not only produces an abstract syntax tree, but
160also calculates type information and can perform several
161transformations to provide partial support for features that are not
162present in the formalised compiler.  In particular, bitfields,
163structure passing and assignment, and \lstinline'long double' and
164\lstinline'long long' types are (optionally) handled here.  There is a
165further transformation to simplify the annotated C abstract syntax
166tree into the CompCert C language.
168Earlier version of CompCert did not feature CompCert C at all, but the
169less complex Clight language.  Clight lives on as the next
170intermediate language in the compiler.  Hence several publications on
171CompCert refer to this
172language~\cite{springerlink:10.1007/s10817-009-9148-3}.  The major
173difference between the languages is that the syntax and semantics of
174expressions in Clight are much simpler because they are deterministic
175and side-effect free.  Moreover, CompCert's original semantics were in
176big-step form and lacked support for \texttt{goto} statements.
177Version 1.5 added these using a small-step semantics, and 1.8 added the
178CompCert C language with its C-like expressions.  The latter
179effectively moved some of the work done by the OCaml parser into the
180formalised compiler.
181% TODO: mention parser changes in 1.7?
182% TODO: In response to Csmith?
184CompCert C is also interesting because it has two semantics: one with
185non-deterministic behaviour for expressions which models the
186uncertainty about evaluation order in C, and a deterministic semantics
187which resolves this uncertainty.  The former can be regarded as the
188intended input language of the compiler, and the latter as a more
189specific version that the compiler actually implements.  In addition,
190these semantics have slightly different forms: in particular the
191non-deterministic version is always small-step, whereas the
192deterministic form makes a single big-step for side-effect free
193subexpressions.  The two are connected by a result showing that
194programs which are \emph{safe} in the non-deterministic semantics
195admit some execution in the deterministic semantics, where by
196\emph{safe} we mean that the program never gets stuck regardless of
197the order of evaluation used (for a fixed choice of I/O behaviour).
198This result essentially holds because the deterministic semantics is
199choosing one of the possible evaluation orders.
201Note that throughout the definitions of all of these semantics we
202benefit from a specialisation of C to the targets that CompCert
203supports.  Various \emph{implementation-defined} parts of the language
204such as integer representation and sizes are fixed (in fact, the
205formalisation heavily uses the assumption of a 32-bit word size).
206% which is the largest integer type supported, making long long a pain
207Also, some \emph{undefined behaviour} is given a meaning; notably
208mixtures of reads and writes to a variable in a single expression are
209given a non-deterministic behaviour rather than ruled out as per the C
210standard~\cite[\S 6.5.16]{c99}.
212\section{Construction of the executable semantics}
215The executable semantics consists of several parts: existing
216executable definitions from CompCert, functions which closely
217correspond to the original relational semantics, a mechanism for
218resolving non-deterministic behaviour, and finally the proofs that
219steps of the relational semantics are equivalent to steps of the
220executable semantics.  We do not consider whole program executions or
221interaction with the outside world (I/O) here, but only individual
222steps of the semantics.  It is not difficult to include these (we have done
223this for Clight as part of the CerCo project~\cite{EDI-INF-RR-1412}), but we do
224not expect that it would yield any worthwhile insights during testing
225to justify the effort involved.
227The existing definitions that we can reuse include the memory model
228that is used throughout the CompCert development~\cite{compcert-mm08}.
229This model features symbolic pointers with discrete abstract blocks of
230memory, and so is more suitable for our purposes than non-executable
231alternatives such as a concrete model with non-deterministic choice of
232fresh locations for allocations.  The semantics of operations on
233values (for example, \texttt{+} and \texttt{==}) are naturally defined
234as auxiliary functions in the relational semantics, which we reuse.
235Casts and the conversion of values to booleans are exceptions to this;
236they are defined inductively in the relational semantics and we treat
237them in the same way as the rest of the inductive definitions, below.
239Local and global variable environments are necessarily executable in
240CompCert because they are used throughout the actual compiler code in
241addition to the semantics.  We also make use of the compiler's error
242monad to recover the partiality of the relational semantics (that is,
243CIC is a language of total functions, so we must model failure
246The environments are efficiently implemented because of their use in
247the compiler itself.  However, the memory model was not intended to be
248efficient and builds large chains of closures to represent changes to
249memory.  Fortunately, the performance was sufficient for our testing,
250so we did not attempt to optimise it.
252\subsection{Executing a step of the semantics}
254The relational semantics uses a series of inductive definitions to
255describe the steps of the abstract machine: one for reducing
256\emph{lvalues}, expressions which represent a location that can be
257read from, written to, or extracted as a pointer; one for
258\emph{rvalues} expressions that only yield a value, such as addition;
259and one for setting up function calls:
261Inductive lred: expr -> mem -> expr -> mem -> Prop := ...
262Inductive rred: expr -> mem -> expr -> mem -> Prop := ...
263Inductive callred: expr -> fundef -> list val -> type -> Prop := ...
265The \lstinline'estep' relation represents whole state transitions
266corresponding to the reductions on expressions given by the first
267three relations, and the \lstinline'sstep' relation performs steps of
268statements and function entry and exit:
270Inductive estep: state -> trace -> state -> Prop := ...
271Inductive sstep: state -> trace -> state -> Prop := ...
273The union of these two relations gives the overall \lstinline'step' relation.
275The executable semantics uses functions corresponding to each of these
276relations.  The syntax directed nature of the inductive definitions
277makes this task straightforward.  For example, consider the
278\lstinline'rred' rules for conditional expressions:
280  | red_condition_true: forall v1 ty1 r1 r2 ty m,
281      is_true v1 ty1 -> typeof r1 = ty ->
282      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
283           (Eparen r1 ty) m
284  | red_condition_false: forall v1 ty1 r1 r2 ty m,
285      is_false v1 ty1 -> typeof r2 = ty ->
286      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
287           (Eparen r2 ty) m
289These state that a true conditional reduces to its left subexpression
290and a false one to its right, subject to a typing constraint.
292We rearrange the rules to fit in a tree of pattern matching, check
293type equalities where necessary, and introduce error messages where no
294reduction is possible:
296Definition exec_rred (e:expr) (m:mem) : res (expr * mem) :=
297match e with
299| Econdition (Eval v1 ty1) r1 r2 ty =>
300    do b <- exec_bool_val v1 ty1;
301    let r := if b then r1 else r2 in
302    match type_eq (typeof r) ty with
303    | left _ => OK (Eparen r ty, m)
304    | right _ => Error (msg "type mismatch in Econdition")
305    end
307This transforms the value \lstinline[language=coq]'v1' to a boolean,
308selects the appropriate subexpression and checks its type.  The
309`\lstinline[language=coq]'do'' notation is a use of the error monad:
310the conversion to a boolean is not always defined and may return an
311error, which the monad propogates to the caller of
312\lstinline[language=coq]'exec_rred'.  The auxiliary relations defining
313the relationship between values and booleans,
314\lstinline[language=coq]'exec_bool_val', and casting are also defined
315as functions.
317Defining a function for the statement reduction, \lstinline'sstep', is
318similar.  One part of function entry handling deserves particular
319attention.  CompCert has a notion of \emph{external functions}, which
320are a set of primitive CompCert C functions that are handled outside
321of the program to provide system calls, dynamic memory allocation,
322volatile memory accesses and annotations.  Inductive predicates
323describe what changes to the program's state are allowed.  We have
324only implemented the dynamic memory allocation (by constructing a
325function similar to the predicates for \lstinline'malloc' and
326\lstinline'free'), and provide a predicate on states to identify uses
327of the other calls so that we may give a partial completeness result
328in the next section.
330The expression relation \lstinline'estep' is more difficult.  This is
331the point in the semantics at which non-deterministic behaviour
332appears --- any subexpression in a suitable \emph{context} can be
333reduced.  The contexts are defined by an inductive description of
334which subexpressions can be evaluated in the current state (for
335example, only the guard of a conditional expression can be evaluated
336before the conditional expression itself).  We could execute the
337non-deterministic semantics by collecting all of the possible
338reductions in a set, but we are primarily interested in testing the
339semantics on well behaved C programs, and an existing theorem in
340CompCert shows that any program with a \emph{safe} non-deterministic
341behaviour also has a behaviour in the deterministic semantics.
343Hence we are satisfied with following a single path of execution, and
344so parametrise the executable semantics by a \emph{strategy} function
345which determines the evaluation order by picking a particular
346subexpression, and also returns its context.
347These contexts should be valid with respect to the inductive definition
348mentioned above.
349Moreover, if we choose a function matching the deterministic semantics
350then we know that any well-defined C program which goes wrong is
351demonstrating a bug which affects \emph{both} sets of semantics.  Regardless
352of the choice of strategy, we can used failed executions to pinpoint
353rules in the semantics which concern us.  We will discuss the
354implemented strategies and their implications in the following
355section. % TODO: make sure it is the following section later or change to a ref
358Finally, to match the non-deterministic semantics precisely we must
359also check that there are no \emph{stuck} subexpressions.  This
360detects subexpressions which cannot be reduced because they are
361erroneous, rather than because they are already values.  Normally the
362lack of any reducible subexpression would indicate an error, but the
363extra check is required because an accompanying non-terminating
364subexpression will mask it.  The CompCert documentation gives
365the example
366\lstinline[language=C]'f() + (10 / x)' where \lstinline'x' is
367initially 0 and \lstinline'f' is non-terminating.  Without the check,
368a program containing this expression would appear to be well-defined
369because \lstinline'f' can always be reduced, whereas we should make
370the program undefined because of the potential division by zero.
372Informally, this condition states that \emph{any subexpression in an
373  evaluation context is either a value, or has a further subexpression
374  that is reducible}.  This form suggests an inefficient executable
375approach: for every context perform a search for the reducible
376subexpression.  Fortunately it can be implemented by a
377relatively direct recursive function because we can reuse reducible
378subexpressions that we have already found in a larger context.  Thus
379we only have to check that an expression can be reduced (using the
380reduction functions discussed above) when all of the subexpressions have
381already been reduced to values.
383The deterministic semantics does not enforce this condition because it
384commits to one particular evaluation, regardless of whether other
385evaluations may fail.  Thus we make the executable version of the
386check optional so that we may implement either semantics.
388\subsection{Equivalence to the non-deterministic semantics}
390We show the equivalence of the executable semantics to the original
391relational semantics in two parts.  First, we show that any successful
392execution of a step corresponds to some derivation in the original
393semantics, regardless of the strategy we choose.
396Given a strategy which picks valid contexts, the execution of any step
397$s_1$ to $s_2$ with trace $t$ implies that there exists a derivation
398of \lstinline'step' $s_1$ $t$ $s_2$.
401The proof consists of building similar lemmas for each relation in the
402semantics, using case analysis on the states, statements and
403expressions involved, reduction of the executable hypothesis
404and reasoning on equalities % TODO: woolly
405until the premises of the constructor are realised.  The check for
406\emph{stuck} subexpressions is the most involved task: we must prove a
407more general result differentiating between reducible and fully
408reduced expressions, and show that reducible terms found in
409subexpressions can be lifted.  If no such term exists, we show that
410attempting to reduce the whole term is sufficient.
412The second part is to show that any step of the original semantics is
413successfully executed by the executable semantics.
415\begin{theorem}[Completeness --- non-deterministic]
416For any derivation of \lstinline'step' $s_1$ $t$ $s_2$ which is not
417about to call an unsupported external function there exists a strategy
418$p$ such that
419\lstinline'exec_step' $p$ $s_1$ = \lstinline'OK' $(t,s_2).$
422As in the soundness proof, a lemma is shown for each relation.  The general
423form of the proofs is to perform inversion on the hypothesis for the
424derivation and use rewriting and the prior lemmas to reduce the goal.
425Again, the stuckness check is more difficult:  it follows by induction
426on the expression, and showing that if a subexpression gets stuck then
427the parent expression gets stuck too.
429Note that each step might require a different strategy because we have
430restricted our attention to strategies that can be expressed as
431functions on the abstract machine state.  This rules out strategies
432where the execution order depends on (for example) previous states or
433randomness, but these do not occur in the CompCert compiler.
435\subsection{Strategies and the deterministic semantics}
437Two evaluation strategies were implemented for the executable
438semantics.  The first was a simple leftmost-innermost strategy that
439picks the first non-value subexpression that we are allowed to
440reduce.  It was chosen because it is simple to implement, sufficient
441for testing (any bugs were likely to be independent of evaluation
442order, and our experience with the second strategy supported this),
443and could be used as the basis for the second strategy.
445% TODO: have I included a discussion of missing behaviour vs wrong
446% behaviour vs extra behaviour?
448However, to provide the greatest benefit from the executable semantics
449we wanted to get executions which precisely match the deterministic
450semantics that the compiler actually implements.  That is, we wish to
451have an interpreter that predicts exactly the behaviour of the
452compiled program.  This is not entirely straightforward because the
453original deterministic semantics are formalised in a slightly
454different way to non-deterministic executions: expressions with no
455side effects are dealt with by a big-step relation, and effectful
456expressions (where control flow such as the conditional operator is
457considered an effect) use a small-step relation as before.
459Nonetheless, we can encode the resulting evaluation order as a
462\item find the leftmost-innermost \emph{effectful} subexpression (or
463  take the whole expression if it is effect-free), then
465  \begin{enumerate}
466  \item pick the leftmost-innermost \emph{effect-free} non-value
467    subexpression of that if one is present, otherwise
468  \item reduce the whole subexpression.
469  \end{enumerate}
471Intuitively, this works because the effect-free subexpressions can be
472reduced in any order that is allowed without changing the result.
474To show formally that this matches the relational semantics we first
475show that the big-steps of effect-free subexpressions can be
476decomposed into a series of small-steps using the relations from the
477non-deterministic semantics.  We can then show that iterating the
478executable semantics will perform the same steps, reaching the same
479end state as the big-step.  Then using some lemmas to show that the
480\lstinline'leftcontext's from the deterministic semantics are found by
481our leftmost-innermost effectful subexpression strategy, we can show
482that any step of the deterministic semantics is performed by some
483number of steps of the executable semantics:
484\begin{theorem}[Completeness --- deterministic]
485  For any derivation of \lstinline'step' $s_1$ $t$ $s_2$ in the
486  deterministic semantics which is not about to call an unsupported
487  external function, repeated use of the \lstinline'exec_step'
488  function starting with $s_1$ using the above strategy yields the
489  state $s_2$.  Moreover, the concatenation of the resulting traces is
490  $t$.
493Together with the connections between the non-deterministic semantics
494and the executable semantics, and the result from CompCert linking the
495two relational semantics we get the relationships depicted in
501\caption{Relationships between the different CompCert C semantics}
504We can see that the new result completes a loop --- confirming the
505intuition that the deterministic semantics is the same as the
506non-deterministic semantics with a particular choice of evaluation order,
507but without the `stuckness-check'.
508% TODO: describe the diagram's meaning a little more?
510We also see directly from the executable semantics that the evaluation
511order is the \emph{only} possible source of non-determinism besides
512I/O, because each step of the semantics can be expressed as a function
513rather than a relation.
515\subsection{Informal OCaml code}
518To produce the actual interpreter we add some OCaml code which uses the
519existing CompCert parser to produce CompCert C from C source code, and then
520iterates the step function extracted from Coq until the program successfully
521completes, or fails with an error.
523Ideally we would fix each bug that we encounter when testing this
524interpreter, but this can involve repairing the specifications, the
525compiler, and the proofs; all of which CompCert's developers are
526better placed to deal with.  Nonetheless, we wish to continue testing
527in spite of any bugs found.  To achieve this we detect troublesome
528states in the OCaml main loop and override the normal behaviour of the
529executable semantics without changing the formal development.  To
530assist in the implementation of these workarounds we use a higher order
531function to apply a local change to an expression to any applicable
532subexpression appearing in an evaluation context.
534We also take the opportunity to add some support functions that are
535outside the scope of the semantics, but which are informally supported by the
536compiler.  They are implemented by detecting states which call these
537functions and running an OCaml version instead.  We provide
538\lstinline'exit' and \lstinline'abort' (trivial functions to halt
539execution), \lstinline'memcpy' and \lstinline'memcmp' (impossible to
540implement in CompCert C because the memory model does not provide
541byte-by-byte representations of pointers\footnote{Implementation of
542  \lstinline'memcpy' and \lstinline'memcmp' was assisted by the fact
543  that Coq's extraction process exposes the internal representation of
544  the memory model, rather than forcing us to implement
545  them through the model's interface.}) and a limited version of
546\lstinline'printf'.  CompCert's semantics also assume that the entire
547program is present in a single file, so we provide some support for
548merging definitions from multiple files to simulate separate
551All of these can be added without requiring any extra proof or
552changing the compiler or the semantics.  They can also be turned off
553to be sure that problems we encounter come from the formalised
554semantics and not the workarounds.
556Finally, we note that the execution of the semantics can be examined
557in the debugger packaged with OCaml when absolute certainty about the
558cause of a problem is required.
560The full development is available online as a modified version of CompCert\footnote{\url{}}.
565After some basic testing to ensure that the interpreter was
566functioning correctly (in particular, testing the informal driver code),
567we proceeded with the example that illustrated an omission in CerCo's
568Clight semantics, then attempted more demanding tests.
570\subsection{Function pointers}
572The simple program
574int zero(void) { return 0; }
576int main(void) {
577  int (*f)(void) = zero;
578  return f();
581returns zero by a call to the \lstinline'zero' function via a function
582pointer.  It failed during testing of CerCo's executable semantics due
583to a bad type check, and CompCert C was suspected to suffer from the
584same problem.  Indeed, execution fails at the call with the same type
585error: the variable \lstinline'f''s type is a pointer to a function and the
586semantics only accepts a function type.
588There are two noteworthy aspects to this bug.  First, the compiler
589itself features the correct type check and works perfectly.  The
590semantics can be easily modified to use the same check, and the only
591other change required is to \emph{remove} some steps from the proof
592scripts.  Second, the bug only became relevant from version 1.7 of
593CompCert --- before which the parser always added an explicit
594dereference.  This illustrates a general issue: the intended semantics
595of CompCert C is unclear because it is difficult to understand what we
596should assume about the output of the complex parser.
600Csmith is a tool for the random generation of test cases for C
601compilers that uses a mixture of static and dynamic checks to ensure
602that the generated code has well-defined
603behaviour~\cite{Yang:2011:FUB:1993498.1993532}.  Failing test cases
604are identified by the compiler failing with a crash or error, the
605generated code crashing, or by comparing the output of the test case
606with that obtained using different compilers.  Each generated test
607case outputs a checksum to summarise its runtime behaviour for
610Csmith is particularly interesting because it has already been used to
611test the CompCert compiler as a whole where it detected several bugs
612in the informal part of the compiler and a mismatch between the
613assembler output and the assembler's behaviour.  These problems have
614since been corrected, partly by the introduction of CompCert C as the
615input language.  Thus it gives us the opportunity to compare the
616semantics against the actual compiler.  However, it is not an ideal
617tool for testing semantics because it focuses on detecting
618`middle-end' bugs, and thus only generates a limited range of
619front-end language features.
621Given the previous testing of the compiler, it was unsurprising that
622we found no problems when executing the randomly generated code with
623the interpreter and comparing the results against the
624compiler\footnote{Using Csmith version 2.0.}.  However, we did
625encounter a failure with the \emph{non-random} support code which
626implements the dynamic checks that prevent undefined behaviour in
627arithmetic operations.  The failing code can be seen in
631int8_t lshift_func_int8_t_s_s(int8_t left, int right)
633  return
634    ((left < 0) ||
635     (((int)right) < 0) ||
636     (((int)right) >= 32) ||
637     (left > (INT8_MAX >> ((int)right)))) ?
639    left :
641    (left << ((int)right));
644\caption{Excerpt of the safe arithmetic code from Csmith (edited for readability)}
648 The failure is caused by the reduction rules for conditional expressions:
650  | red_condition_true: forall v1 ty1 r1 r2 ty m,
651      is_true v1 ty1 -> typeof r1 = ty ->
652      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
653           (Eparen r1 ty) m
654  | red_condition_false: forall v1 ty1 r1 r2 ty m,
655      is_false v1 ty1 -> typeof r2 = ty ->
656      rred (Econdition (Eval v1 ty1) r1 r2 ty) m
657           (Eparen r2 ty) m
659Each rule requires the type of the chosen subexpression to equal the
660type of the entire expression.  However, here one branch is an 8-bit
661integer, and the other a 32-bit integer so one of the rules cannot be
662applied.  The C standard requires that `the usual arithmetic
663conversions' should be applied to coerce the result to a common
664type~\cite[\S6.5.15]{c99}, so the 8-bit integer should be promoted to
665a 32-bit one.
667As with the function pointers, this test works with the compiler
668because the two types have a common representation --- all integers
669are handled as 32-bit words in CompCert.  It is only when they are
670loaded and stored to memory that the integer size is taken into
671account.  Nonetheless, this failure provides us with enough
672information to construct an example on which the compiler fails
673because the representations differ:
675double f(int x, int a, double b) {
676  return x ? a : b;
679Fortunately, the compiler performs some type reconstruction for the RTL
680intermediate language which makes it produce an error rather than
681bad code.  Curiously, this means that the overall correctness theorem
682still holds if the semantics are corrected because the compiler
683is always allowed to fail, even on reasonable programs.  However, the
684intermediate results about the front-end become false because the bug
685is present much earlier in the compiler than the type reconstruction.
687Alleviating this bug was essential for proper testing with Csmith, but
688a full solution requires fixing the semantics, compiler and proofs and
689is beyond the scope of this work.  By implementing an alternative
690reduction rule in OCaml as a workaround we were able to continue
691testing without difficulty.
695The GCC compiler contains an extensive test suite for several of its
696supported languages~\cite[\S 6.4]{gccint4.4}.  The gcc-torture suite
697contains an executable subset of the tests in C.  Many of the test
698cases use GCC-specific features, but we are able to reuse a subset
699selected by another executable C semantics project,
700\texttt{kcc}~\cite{Ellison:2012:EFS:2103656.2103719}.  Unlike the
701Csmith generated tests, gcc-torture contains many specialised tests
702for corner-cases of the C standard.  In addition to finding errors,
703these also serve to highlight deliberate limitations in CompCert's
706A small test harness ran each test case with the executable
707semantics.  The failing cases were manually classified and in a few
708cases a workaround or fix to the parser was added to prevent known
709bugs hiding further issues.
711The tests revealed that the semantics were missing some minor casting
712rules and that zero initialisation of file scope variables was not
713performed when initialisation the memory model.  Both were handled
714correctly by the compiler, in the latter case by the informal code
715that produces the assembler output.
717Several issues with the OCaml parser appeared: initialisation of an
718array by a shorter string did not add null-padding; arrays were not
719permitted to the left of a \lstinline'->' dereference; incomplete
720array types were not fully supported; and the reported line numbers in
721error messages could be wrong.  Ironically, the latter was caused by
722side-effects and OCaml's own non-deterministic evaluation order.
724Deliberate limitations of CompCert were strongly apparent in the
725result too.  Unsupported and partially supported constructs such as
726\lstinline'long long' integer types and bitfields caused many test
727case failures.  A more interesting case is that the comparison of
728pointers is undefined when one points just beyond the end of the
729object (which is explicitly permitted by the C standard), or when a
730function pointer is used.  However, we were unable to find any mention
731of this limitation in the documentation, and without testing would
732never have known about it.
734\section{Related work}
737The CompCert developers had already proposed implementing an
738executable semantics for the first version of Clight, but in a
739big-step style with a bound on the depth of the evaluation to ensure
740termination~\cite[\S 5.4]{springerlink:10.1007/s10817-009-9148-3}.
741However, it would be difficult to implement the workarounds described
742in Section~\ref{sec:driver} with a big-step semantics.
744Reaction to this work from the CompCert developers has been positive:
745bugs identified before the next release were fixed (in the case of the
746missing casts, independently before they were identified by testing),
747and Leroy has subsequently added a similar small-step interpreter.
748The new interpreter differs from the present work by focusing solely
749on the non-deterministic semantics and computing all of the successor
750states at once.  This greatly simplifies the checking and proof of
751non-stuckness described in Section~\ref{sec:nonstuck}, and provides an
752option to explore the entire space of evaluation orders (which is
753surprisingly tractable).  However, there is no easy way to mimic just
754the deterministic evaluation order, and no tricks in the driver code
755to make testing easier.
757% NB: if I change the formatting of Clight, change ClightTSO too
758\v{S}ev\v{c}\'{i}k et al.~\cite{Sevcik:2011:RCV:1926385.1926393} have
759created CompCertTSO, a derivative of an earlier version of CompCert
760which supports concurrent shared-memory programs with a \emph{Total
761  Store Ordering} (TSO) memory model.  The source language for the
762formalised part of the compiler is ClightTSO and is given an
763executable semantics in addition to the usual relational semantics,
764which `revealed a number of subtle errors.'  The main issues found in
765the present work did not arise because it was based on a version of
766CompCert that predated their introduction, and the pointer comparisons
767were not a problem because they were specified differently in
770Ellison and Ro{\c s}u~\cite{Ellison:2012:EFS:2103656.2103719} have
771developed a semantics for C based on rewriting logic that aims to be
772as close to the standard as possible, including many parts of the
773language that are not currently supported by CompCert.  An interpreter
774called \texttt{kcc} is derived from the semantics and has been tested
775against the gcc-torture test suite that we reused.  These semantics
776are not used for compiler verification, but are intended for studying
777(and ultimately verifying) the behaviour of C programs.
779The Piton compiler~\cite{springerlink:10.1007/BF00243133} was
780formalised in the Boyer-Moore prover, which is an example of an
781environment where an executable semantics is the natural choice.  The
782Piton language is a very low-level systems language defined for the
783project, so they do not benefit from preexisting test suites.
784However, the work is particularly interesting due to the connected
785hardware formalisation of the target.  A later use of executable semantics
786in ACL2 demonstrated the usefulness of executable semantics in hardware
787verification because running an existing test suite for AMD's RTL language
788against the semantics was crucial for convincing managers that the model
789was relevant~\cite[\S 3]{springerlink:10.1007/3-540-49519-3_22}.
791Lochbihler and
792Bulwahn~\cite{springerlink:10.1007/978-3-642-22863-6_17} applied and
793extended Isabelle's code extraction and predicate
794compiler~\cite{springerlink:10.1007/978-3-642-03359-9_11} to animate
795the semantics of a multithreaded Java formalisation in Isabelle/HOL,
796JinjaThreads.  This is a very appealing approach because the predicate
797compiler deals with most of the burden of writing an executable
798version of an inductively defined semantics, although JinjaThreads
799still required a substantial amount of work to deal with some
800difficult definitions.  Their description of testing focuses on
801performance rather than correctness, but executable versions of the
802code have been tested on an ongoing basis since the earlier Jinja
803formalisation that it is based
806% TODO: Norrish, C0, other exec sem?
810We have shown that executable semantics can be useful enough for
811validating semantics to justify retrofitting them to existing
812verified compilers, and in the case of CompCert found a real compiler
813bug in the formalised front-end, alongside numerous minor issues.  It
814also illustrates that testing of the semantics can be more sensitive
815than testing the compiler: our original failing case for the
816conditional expressions bug would (and did) pass compiler testing,
817but using that failure in the semantics we were able to derive a test
818case that also failed in the compiler.
820In addition to the testing, the executable semantics were also useful
821for demonstrating the limitations of the semantics, both known and
822unknown.  Moreover, we were able to take the opportunity to prove that
823an intuitive relationship between the deterministic and
824non-deterministic semantics of CompCert C holds.
Note: See TracBrowser for help on using the repository browser.