# Changeset 2361

Ignore:
Timestamp:
Sep 27, 2012, 4:24:04 PM (9 years ago)
Message:

Just half of a page left to find.

File:
1 edited

### Legend:

Unmodified
 r2360 \begin{abstract} We present a proof of correctness, in Matita, for an optimising assembler for the MCS-51 microcontroller. The efficient expansion of pseudoinstructions---namely jumps---into MCS-51 machine instructions is complex. We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using policies'. This makes the proof of correctness for the assembler significantly more straightforward. We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program. Assembly language programs can manipulate concrete addresses in arbitrary ways. Our proof strategy contains a tracking facility for good addresses' and only programs that use good addresses have their semantics preserved under assembly. We present a proof of correctness in Matita for an optimising assembler for the MCS-51 microcontroller. The efficient expansion of pseudoinstructions, namely jumps, into machine instructions is complex. We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using policies', making the proof of correctness for the assembler more straightforward. %We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program. %Assembly language programs can manipulate concrete addresses in arbitrary ways. Our proof strategy contains a tracking facility for good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program. Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable. In particular, we may experiment with allowing the benign manipulation of addresses. The MCS-51 dates from the early 1980s and is commonly called the 8051/8052. Despite the microprocessor's age, derivatives are still widely manufactured by a number of semiconductor foundries, with the processor being used especially in embedded systems development, where well-tested, cheap, predictable microprocessors find their niche. Derivatives are still widely manufactured by a number of semiconductor foundries, with the processor being used especially in embedded systems. The MCS-51 has a relative paucity of features compared to its more modern brethren, with the lack of any caching or pipelining features meaning that timing of execution is predictable, making the MCS-51 very attractive for CerCo's ends. This problem is known to be computationally hard for most CISC architectures (see~\cite{hyde:branch:2006}). To free CerCo's C compiler from having to consider complications relating to branch displacement, we have chosen to implement an optimising assembler, whose input language the compiler will target. To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target. Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language. We simplify the proof by assuming that all our assembly programs are pre-linked (i.e. we do not formalise a linker---this is left for future work). We further simplify by ignoring linking, assuming that all our assembly programs are pre-linked. Another complication we have addressed is that of the cost model. This is stated in an elegant way in the dependent type of the assembler: the assembly function is total over a program, a policy and the proof that the policy is correct for that program. The final complication in the proof of correctness of our optimising assembler is due to the kind of semantics associated to pseudo-assembly programs. Should assembly programs be allowed to freely manipulate addresses? The answer to the question deeply affects the proof of correctness. The traditional answer is no: values stored in memory or registers are either concrete data or symbolic addresses. The latters can be manipulated only in very restrictive ways and many programs that do not do so, for malign or benign reasons, are not assigned a semantics and cannot be reasoned about. All programs that have a semantics have it preserved by the compiler. Instead we took a different, novel approach: we allow programs to freely manipulate addresses non symbolically, but we only grant preservation of the semantics for those programs that do behave in a correct, anticipated way. At least in principle, this should allow some reasoning on the actual semantics of malign programs. In practice, we note how the alternative approach allows more code reusal between the semantics of assembly code and object code, with benefits on the size of the formalisation. A final complication in the proof is due to the kind of semantics associated to pseudo-assembly programs. Should assembly programs be allowed to freely manipulate addresses? The traditional answer is no': values stored in memory or registers are either concrete data or symbolic addresses. The latter can only be manipulated in very restricted ways and programs that do not do so are not assigned a semantics and cannot be reasoned about. All programs that have a semantics have it preserved by the assembler. We take an alternative approach, allowing programs to freely manipulate addresses non-symbolically but only granting a preservation of semantics to those programs that act in well-behaved' ways. In principle, this should allow some reasoning on the actual semantics of malign programs. In practice, we note how our approach facilitates more code reuse between the semantics of assembly code and object code. The rest of this paper is a detailed description of our proof that is marginally still a work in progress. In Subsection~\ref{subsect.matita} we provide a brief overview of the Matita proof assistant for the unfamiliar reader. In Section~\ref{sect.the.proof} we discuss the design and implementation of the proof proper. In Section~\ref{sect.conclusions} we conclude. % ---------------------------------------------------------------------------- % % SECTION                                                                      % % ---------------------------------------------------------------------------- % \subsection{Matita} \label{subsect.matita} \paragraph{Matita} Matita is a proof assistant based on a variant of the Calculus of (Co)inductive Constructions~\cite{asperti:user:2007}. It features dependent types that we exploit in the formalisation. The aim of the section is to explain the main ideas and steps of the certified proof of correctness for an optimizing assembler for the MCS-51. The formalisation is available at~\url{http://cerco.cs.unibo.it}. proof of correctness for an optimizing assembler for the MCS-51. In Section~\ref{subsect.machine.code.semantics} we sketch an operational semantics (a realistic and efficient emulator) for the MCS-51. In Section~\ref{subsect.assembly.code.semantics} we describe the assembly language and its operational semantics. The latter is parametric in the cost model that will be induced by the assembler. It reuses the semantics of the machine code on all real' (i.e. non-pseudo-) instructions. The latter is parametric in the cost model that will be induced by the assembler, reusing the semantics of the machine code on all real' instructions. Branch displacement policies are introduced in Section~\ref{subsect.the.assembler} where we also describe the assembler as a function over policies as previously described. The proof of correctness of the assembler consists in showing that the object code given in output, together with a cost model for the source program, simulates the source program executed using that cost model. To prove our assembler correct we show that the object code given in output, together with a cost model for the source program, simulates the source program executed using that cost model. The proof can be divided into two main lemmas. The first is correctness with respect to fetching, described in Section~\ref{subsect.total.correctness.of.the.assembler}. It roughly states that one step of fetching at the assembly level that returns the decoded instruction $I$ is simulated by $n$ steps of fetching at the object level that returns instructions $J_1,\ldots,J_n$, where $J_1,\ldots,J_n$ is, amongst the possible expansions of $I$, the one picked by the policy. The second lemma shows that $J_1,\ldots,J_n$ simulates $I$ but only if $I$ is well-behaved, i.e. it manipulates addresses in ways that are anticipated in the correctness proof. To keep track of well-behaved address manipulations, we couple the assembly status with a map that records where addresses are currently stored in memory or in the processor's accumulators. We then introduce a dynamic checking function that inspects the assembly status and this map to determine if the operation is well-behaved. An affirmative answer is the pre-condition of the lemma. Roughly it states that a step of fetching at the assembly level, returning the decoded instruction $I$, is simulated by $n$ steps of fetching at the object level that returns instructions $J_1,\ldots,J_n$, where $J_1,\ldots,J_n$ is, amongst the possible expansions of $I$, the one picked by the policy. The second lemma states that $J_1,\ldots,J_n$ simulates $I$ but only if $I$ is well-behaved, i.e. manipulates addresses in good' ways. To keep track of well-behaved address manipulations we record where addresses are currently stored (in memory or an accumulator). We introduce a dynamic checking function that inspects this map to determine if the operation is well-behaved, with an affirmative answer being the pre-condition of the lemma. The second lemma is detailed in Section~\ref{subsect.total.correctness.for.well.behaved.assembly.programs} where we also establish correctness of our assembler as a composition of the two lemmas: programs that are well-behaved when executed under the cost model induced by the compiler are correctly simulated by the compiled code. For all newly introduced pseudoinstructions, we simply translate labels to concrete addresses before behaving as a real' instruction. In contrast to other approaches, we do not perform any kind of symbolic execution, wherein data is the disjoint union of bytes and addresses, with addresses kept opaque and immutable. We do not perform any kind of symbolic execution, wherein data is the disjoint union of bytes and addresses, with addresses kept opaque and immutable. Labels are immediately translated to concrete addresses, and registers and memory locations only ever contain bytes, never labels. As a consequence, we allow the programmer to mangle, change and generally adjust addresses as they want, under the proviso that the translation process may not be able to preserve the semantics of programs that do this. \end{cases} \end{gather*} In contrast, in this paper we take a different approach. We trace memory locations (and, potentially, registers) that contain memory addresses. We then prove that only those assembly programs that use addresses in safe' ways have their semantics preserved by the assembly process---a sort of dynamic type system sitting atop memory. We believe that this approach is more flexible when compared to the traditional approach, as in principle it allows us to introduce some permitted \emph{benign} manipulations of addresses that the traditional approach, using opaque addresses, cannot handle, therefore expanding the set of input programs that can be assembled correctly. This approach, of using real addresses coupled with a weak, dynamic typing system sitting atop of memory, is similar to one taken by Tuch \emph{et al}~\cite{tuch:types:2007}, for reasoning about low-level C code. In this paper we take a different approach, tracing memory locations (and accumulators) that contain memory addresses. We prove that only those assembly programs that use addresses in safe' ways have their semantics preserved by the assembly process---a sort of dynamic type system sitting atop memory. In principle this approach allows us to introduce some permitted \emph{benign} manipulations of addresses that the traditional approach, using opaque addresses, cannot handle, therefore expanding the set of input programs that can be assembled correctly. This approach seems similar to one taken by Tuch \emph{et al}~\cite{tuch:types:2007} for reasoning about low-level C code. Our analogue of the semantic function above is merely a wrapper around the function that implements the semantics of machine code, paired with a function that keeps track of addresses. An additional function, \texttt{ticks\_of}, is merely a wrapper around this function. Finally, we are able to state and prove our main theorem. This relates the execution of a single assembly instruction and the execution of (possibly) many machine code instructions, as long as we are able to track memory addresses properly: Finally, we are able to state and prove our main theorem, relating the execution of a single assembly instruction and the execution of (possibly) many machine code instructions, as long as we are able to track memory addresses properly: \begin{lstlisting} theorem main_thm: This verification explicitly assumes the existence of, amongst other things, a trustworthy assembler and compiler. Note that both CompCert, CompCertTSO and the seL4 formalisation assume the existence of trustworthy' assemblers. For instance, the CompCert C compiler's backend stops at the PowerPC assembly language, in the default backend. The observation that an optimising assembler cannot preserve the semantics of every assembly program may have important consequences for these projects (though in the case of CompCertTSO, targetting a multiprocessor, what exactly constitutes the subset of good programs' may not be entirely clear). If CompCert chooses to assume the existence of an optimising assembler, then care should be made to ensure that any assembly program produced by the CompCert compiler falls into the subset of programs that have a hope of having their semantics preserved by an optimising assembler. CompCert, CompCertTSO and the seL4 formalisation assume the existence of trustworthy' assemblers. For instance, the CompCert C compiler's default backend stops at the PowerPC assembly language. The observation that an optimising assembler cannot preserve the semantics of every assembly program may have consequences for these projects (though in the case of CompCertTSO, targetting a multiprocessor, what exactly constitutes the subset of good programs' may not be entirely clear). If CompCert chooses to assume the existence of an optimising assembler, then care should be made to ensure that any assembly program produced by the CompCert compiler falls into the subset of programs that can have their semantics preserved by an optimising assembler. Our formalisation exploits dependent types in different ways and for multiple purposes. Not every proof has been carried out in this way: we only used this style to prove that a function satisfies a specification that only involves that function in a significant way. It would not be natural to see the proof that fetch and assembly commute as the specification of one of the two functions. \paragraph{Related work} %\paragraph{Related work} % piton We are not the first to consider the correctness of an assembler for a non-trivial assembly language. They provide a compiler, virtual machine and operational semantics for the programming language and virtual machine, and prove that their compiler is semantics and type preserving. We believe some other verified assemblers exist in the literature. However, what sets our work apart from that above is our attempt to optimise the machine code generated by our assembler. This complicates any formalisation effort, as an attempt at the best possible selection of machine instructions must be made, especially important on a device such as the MCS-51 with a minuscule code memory. Further, care must be taken to ensure that the time properties of an assembly program are not modified by the assembly process lest we affect the semantics of any program employing the MCS-51's I/O facilities. Though other verified assemblers exist in the literature what sets our work apart from that above is our attempt to optimise the generated machine code. This complicates any formalisation effort as an attempt at the best possible selection of machine instructions must be made---especially important on devices with limited code memory. Care must be taken to ensure that the time properties of an assembly program are not modified by assembly lest we affect the semantics of any program employing the MCS-51's I/O facilities. This is only possible by inducing a cost model on the source code from the optimisation strategy and input program. This will be a \emph{leit motif} of CerCo. \paragraph{Resources} %\paragraph{Resources} Our source files are available at~\url{http://cerco.cs.unibo.it}. We assumed several properties of library functions', e.g. modular arithmetic and datastructure manipulation. We axiomatised various small functions needed to complete the main theorems, as well as some routine' proof obligations of the theorems themselves, in focussing on the main meat of the theorems. We believe that the proof strategy is sound and that we will be able to close all axioms, up to minor bugs that should have local fixes that do not affect the global proof strategy. We believe that the proof strategy is sound and that all axioms can be closed, up to minor bugs that should have local fixes that do not affect the global proof strategy. The development, including the definition of the executable semantics of the MCS-51, is spread across 29 files, with around 18,500 lines of Matita source.