Ignore:
Timestamp:
Sep 27, 2012, 4:24:04 PM (7 years ago)
Author:
mulligan
Message:

Just half of a page left to find.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • Papers/cpp-asm-2012/cpp-2012-asm.tex

    r2360 r2361  
    4646
    4747\begin{abstract}
    48 We present a proof of correctness, in Matita, for an optimising assembler for the MCS-51 microcontroller.
    49 
    50 The efficient expansion of pseudoinstructions---namely jumps---into MCS-51 machine instructions is complex.
    51 We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies'.
    52 This makes the proof of correctness for the assembler significantly more straightforward.
    53 
    54 We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program.
    55 Assembly language programs can manipulate concrete addresses in arbitrary ways.
    56 Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly.
     48We present a proof of correctness in Matita for an optimising assembler for the MCS-51 microcontroller.
     49The efficient expansion of pseudoinstructions, namely jumps, into machine instructions is complex.
     50We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the proof of correctness for the assembler more straightforward.
     51
     52%We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program.
     53%Assembly language programs can manipulate concrete addresses in arbitrary ways.
     54Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
    5755Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
    5856In particular, we may experiment with allowing the benign manipulation of addresses.
     
    7068
    7169The MCS-51 dates from the early 1980s and is commonly called the 8051/8052.
    72 Despite the microprocessor's age, derivatives are still widely manufactured by a number of semiconductor foundries, with the processor being used especially in embedded systems development, where well-tested, cheap, predictable microprocessors find their niche.
     70Derivatives are still widely manufactured by a number of semiconductor foundries, with the processor being used especially in embedded systems.
    7371
    7472The MCS-51 has a relative paucity of features compared to its more modern brethren, with the lack of any caching or pipelining features meaning that timing of execution is predictable, making the MCS-51 very attractive for CerCo's ends.
     
    9290This problem is known to be computationally hard for most CISC architectures (see~\cite{hyde:branch:2006}).
    9391
    94 To free CerCo's C compiler from having to consider complications relating to branch displacement, we have chosen to implement an optimising assembler, whose input language the compiler will target.
     92To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target.
    9593Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language.
    96 We simplify the proof by assuming that all our assembly programs are pre-linked (i.e. we do not formalise a linker---this is left for future work).
     94We further simplify by ignoring linking, assuming that all our assembly programs are pre-linked.
    9795
    9896Another complication we have addressed is that of the cost model.
     
    127125This is stated in an elegant way in the dependent type of the assembler: the assembly function is total over a program, a policy and the proof that the policy is correct for that program.
    128126
    129 The final complication in the proof of correctness of our optimising assembler
    130 is due to the kind of semantics associated to pseudo-assembly programs.
    131 Should assembly programs be allowed to freely manipulate addresses? The
    132 answer to the question deeply affects the proof of correctness.
    133 The traditional answer is no: values stored in memory or registers are either
    134 concrete data or symbolic addresses. The latters can be manipulated only
    135 in very restrictive ways and many programs that do not do so, for malign or
    136 benign reasons, are not assigned a semantics and cannot be reasoned about.
    137 All programs that have a semantics have it preserved by the compiler.
    138 Instead we took a different, novel approach: we allow programs to freely
    139 manipulate
    140 addresses non symbolically, but we only grant preservation of the semantics
    141 for those programs that do behave in a correct, anticipated way. At least
    142 in principle, this should allow some reasoning on the actual semantics of
    143 malign programs. In practice, we note how the alternative approach allows
    144 more code reusal between the semantics of assembly code and object code,
    145 with benefits on the size of the formalisation.
     127A final complication in the proof is due to the kind of semantics associated to pseudo-assembly programs.
     128Should assembly programs be allowed to freely manipulate addresses?
     129The traditional answer is `no': values stored in memory or registers are either
     130concrete data or symbolic addresses. The latter can only be manipulated
     131in very restricted ways and programs that do not do so are not assigned a semantics and cannot be reasoned about.
     132All programs that have a semantics have it preserved by the assembler.
     133We take an alternative approach, allowing programs to freely
     134manipulate addresses non-symbolically but only granting a preservation of semantics
     135to those programs that act in `well-behaved' ways.
     136In principle, this should allow some reasoning on the actual semantics of malign programs.
     137In practice, we note how our approach facilitates more code reuse between the semantics of assembly code and object code.
    146138
    147139The rest of this paper is a detailed description of our proof that is marginally still a work in progress.
    148 In Subsection~\ref{subsect.matita} we provide a brief overview of the Matita proof assistant for the unfamiliar reader.
    149 In Section~\ref{sect.the.proof} we discuss the design and implementation of the proof proper.
    150 In Section~\ref{sect.conclusions} we conclude.
    151 
    152 % ---------------------------------------------------------------------------- %
    153 % SECTION                                                                      %
    154 % ---------------------------------------------------------------------------- %
    155 \subsection{Matita}
    156 \label{subsect.matita}
    157 
     140\paragraph{Matita}
    158141Matita is a proof assistant based on a variant of the Calculus of (Co)inductive Constructions~\cite{asperti:user:2007}.
    159142It features dependent types that we exploit in the formalisation.
     
    176159
    177160The aim of the section is to explain the main ideas and steps of the certified
    178 proof of correctness for an optimizing assembler for the MCS-51. The
    179 formalisation is available at~\url{http://cerco.cs.unibo.it}.
     161proof of correctness for an optimizing assembler for the MCS-51.
    180162
    181163In Section~\ref{subsect.machine.code.semantics} we sketch an operational semantics (a realistic and efficient emulator) for the MCS-51.
     
    183165
    184166In Section~\ref{subsect.assembly.code.semantics} we describe the assembly language and its operational semantics.
    185 The latter is parametric in the cost model that will be induced by the assembler.
    186 It reuses the semantics of the machine code on all `real' (i.e. non-pseudo-) instructions.
     167The latter is parametric in the cost model that will be induced by the assembler, reusing the semantics of the machine code on all `real' instructions.
    187168
    188169Branch displacement policies are introduced in Section~\ref{subsect.the.assembler} where we also describe the assembler as a function over policies as previously described.
    189170
    190 The proof of correctness of the assembler consists in showing that the object code given in output, together with a cost model for the source program, simulates the source program executed using that cost model.
     171To prove our assembler correct we show that the object code given in output, together with a cost model for the source program, simulates the source program executed using that cost model.
    191172The proof can be divided into two main lemmas.
    192173The first is correctness with respect to fetching, described in Section~\ref{subsect.total.correctness.of.the.assembler}.
    193 It roughly states that one step of fetching at the assembly level that returns the decoded instruction $I$ is simulated by $n$ steps of fetching at the object level that returns instructions $J_1,\ldots,J_n$, where $J_1,\ldots,J_n$ is, amongst the possible expansions of $I$, the one picked by the policy.
    194 The second lemma shows that $J_1,\ldots,J_n$ simulates $I$ but only if $I$ is well-behaved, i.e. it manipulates addresses in ways that are anticipated in the correctness proof.
    195 To keep track of well-behaved address manipulations, we couple the assembly status with a map that records where addresses are currently stored in memory or in the processor's accumulators.
    196 We then introduce a dynamic checking function that inspects the assembly status and this map to determine if the operation is well-behaved.
    197 An affirmative answer is the pre-condition of the lemma.
     174Roughly it states that a step of fetching at the assembly level, returning the decoded instruction $I$, is simulated by $n$ steps of fetching at the object level that returns instructions $J_1,\ldots,J_n$, where $J_1,\ldots,J_n$ is, amongst the possible expansions of $I$, the one picked by the policy.
     175The second lemma states that $J_1,\ldots,J_n$ simulates $I$ but only if $I$ is well-behaved, i.e. manipulates addresses in `good' ways.
     176To keep track of well-behaved address manipulations we record where addresses are currently stored (in memory or an accumulator).
     177We introduce a dynamic checking function that inspects this map to determine if the operation is well-behaved, with an affirmative answer being the pre-condition of the lemma.
    198178The second lemma is detailed in Section~\ref{subsect.total.correctness.for.well.behaved.assembly.programs} where we also establish correctness of our assembler as a composition of the two lemmas: programs that are well-behaved when executed under the cost model induced by the compiler are correctly simulated by the compiled code.
    199179
     
    328308For all newly introduced pseudoinstructions, we simply translate labels to concrete addresses before behaving as a `real' instruction.
    329309
    330 In contrast to other approaches, we do not perform any kind of symbolic execution, wherein data is the disjoint union of bytes and addresses, with addresses kept opaque and immutable.
     310We do not perform any kind of symbolic execution, wherein data is the disjoint union of bytes and addresses, with addresses kept opaque and immutable.
    331311Labels are immediately translated to concrete addresses, and registers and memory locations only ever contain bytes, never labels.
    332312As a consequence, we allow the programmer to mangle, change and generally adjust addresses as they want, under the proviso that the translation process may not be able to preserve the semantics of programs that do this.
     
    541521                                                            \end{cases}
    542522\end{gather*}
    543 In contrast, in this paper we take a different approach.
    544 We trace memory locations (and, potentially, registers) that contain memory addresses.
    545 We then prove that only those assembly programs that use addresses in `safe' ways have their semantics preserved by the assembly process---a sort of dynamic type system sitting atop memory.
    546 
    547 We believe that this approach is more flexible when compared to the traditional approach, as in principle it allows us to introduce some permitted \emph{benign} manipulations of addresses that the traditional approach, using opaque addresses, cannot handle, therefore expanding the set of input programs that can be assembled correctly.
    548 This approach, of using real addresses coupled with a weak, dynamic typing system sitting atop of memory, is similar to one taken by Tuch \emph{et al}~\cite{tuch:types:2007}, for reasoning about low-level C code.
     523In this paper we take a different approach, tracing memory locations (and accumulators) that contain memory addresses.
     524We prove that only those assembly programs that use addresses in `safe' ways have their semantics preserved by the assembly process---a sort of dynamic type system sitting atop memory.
     525In principle this approach allows us to introduce some permitted \emph{benign} manipulations of addresses that the traditional approach, using opaque addresses, cannot handle, therefore expanding the set of input programs that can be assembled correctly.
     526This approach seems similar to one taken by Tuch \emph{et al}~\cite{tuch:types:2007} for reasoning about low-level C code.
    549527
    550528Our analogue of the semantic function above is merely a wrapper around the function that implements the semantics of machine code, paired with a function that keeps track of addresses.
     
    609587An additional function, \texttt{ticks\_of}, is merely a wrapper around this function.
    610588
    611 Finally, we are able to state and prove our main theorem.
    612 This relates the execution of a single assembly instruction and the execution of (possibly) many machine code instructions, as long as we are able to track memory addresses properly:
     589Finally, we are able to state and prove our main theorem, relating the execution of a single assembly instruction and the execution of (possibly) many machine code instructions, as long as we are able to track memory addresses properly:
    613590\begin{lstlisting}
    614591theorem main_thm:
     
    656633This verification explicitly assumes the existence of, amongst other things, a trustworthy assembler and compiler.
    657634
    658 Note that both CompCert, CompCertTSO and the seL4 formalisation assume the existence of `trustworthy' assemblers.
    659 For instance, the CompCert C compiler's backend stops at the PowerPC assembly language, in the default backend.
    660 The observation that an optimising assembler cannot preserve the semantics of every assembly program may have important consequences for these projects (though in the case of CompCertTSO, targetting a multiprocessor, what exactly constitutes the subset of `good programs' may not be entirely clear).
    661 If CompCert chooses to assume the existence of an optimising assembler, then care should be made to ensure that any assembly program produced by the CompCert compiler falls into the subset of programs that have a hope of having their semantics preserved by an optimising assembler.
     635CompCert, CompCertTSO and the seL4 formalisation assume the existence of `trustworthy' assemblers.
     636For instance, the CompCert C compiler's default backend stops at the PowerPC assembly language.
     637The observation that an optimising assembler cannot preserve the semantics of every assembly program may have consequences for these projects (though in the case of CompCertTSO, targetting a multiprocessor, what exactly constitutes the subset of `good programs' may not be entirely clear).
     638If CompCert chooses to assume the existence of an optimising assembler, then care should be made to ensure that any assembly program produced by the CompCert compiler falls into the subset of programs that can have their semantics preserved by an optimising assembler.
    662639
    663640Our formalisation exploits dependent types in different ways and for multiple purposes.
     
    674651Not every proof has been carried out in this way: we only used this style to prove that a function satisfies a specification that only involves that function in a significant way.
    675652It would not be natural to see the proof that fetch and assembly commute as the specification of one of the two functions.
    676 \paragraph{Related work}
     653%\paragraph{Related work}
     654
    677655% piton
    678656We are not the first to consider the correctness of an assembler for a non-trivial assembly language.
     
    683661They provide a compiler, virtual machine and operational semantics for the programming language and virtual machine, and prove that their compiler is semantics and type preserving.
    684662
    685 We believe some other verified assemblers exist in the literature.
    686 However, what sets our work apart from that above is our attempt to optimise the machine code generated by our assembler.
    687 This complicates any formalisation effort, as an attempt at the best possible selection of machine instructions must be made, especially important on a device such as the MCS-51 with a minuscule code memory.
    688 Further, care must be taken to ensure that the time properties of an assembly program are not modified by the assembly process lest we affect the semantics of any program employing the MCS-51's I/O facilities.
     663Though other verified assemblers exist in the literature what sets our work apart from that above is our attempt to optimise the generated machine code.
     664This complicates any formalisation effort as an attempt at the best possible selection of machine instructions must be made---especially important on devices with limited code memory.
     665Care must be taken to ensure that the time properties of an assembly program are not modified by assembly lest we affect the semantics of any program employing the MCS-51's I/O facilities.
    689666This is only possible by inducing a cost model on the source code from the optimisation strategy and input program.
    690 This will be a \emph{leit motif} of CerCo.
    691 \paragraph{Resources}
     667%\paragraph{Resources}
     668
    692669Our source files are available at~\url{http://cerco.cs.unibo.it}.
    693670We assumed several properties of `library functions', e.g. modular arithmetic and datastructure manipulation.
    694671We axiomatised various small functions needed to complete the main theorems, as well as some `routine' proof obligations of the theorems themselves, in focussing on the main meat of the theorems.
    695 We believe that the proof strategy is sound and that we will be able to close all axioms, up to minor bugs that should have local fixes that do not affect the global proof strategy.
     672We believe that the proof strategy is sound and that all axioms can be closed, up to minor bugs that should have local fixes that do not affect the global proof strategy.
    696673
    697674The development, including the definition of the executable semantics of the MCS-51, is spread across 29 files, with around 18,500 lines of Matita source.
Note: See TracChangeset for help on using the changeset viewer.