Changeset 2053


Ignore:
Timestamp:
Jun 13, 2012, 11:02:37 AM (5 years ago)
Author:
mulligan
Message:

Introduction changed, with many paragraphs deleted.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • src/ASM/CPP2012-asm/cpp-2012-asm.tex

    r2052 r2053  
    4646
    4747\begin{abstract}
    48 We consider the formalisation of an assembler for Intel MCS-51 assembly language in the Matita proof assistant.
    49 This formalisation forms a major component of the EU-funded CerCo project, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
     48We present a proof of correctness, in the Matita proof assistant, for an optimising assembler for the MCS-51 8-bit microcontroller.
     49This assembler constitutes a major component of the EU's CerCo (`Certified Complexity') project.
    5050
    5151The efficient expansion of pseudoinstructions---particularly jumps---into MCS-51 machine instructions is complex.
    52 We employ a strategy, involving the use of `policies', that separates the decision making over how jumps should be expanded from the expansion process itself.
    53 This makes the proof of correctness for the assembler significantly more straightforward.
    54 
    55 We prove, under the assumption of the existence of a correct policy, that the assembly process never fails and preserves the semantics of a subset of assembly programs.
    56 Correct policies fail to exist only in a limited number of pathological circumstances.
    57 Our assembler is complete with respect to the choice of policy.
    58 
    59 Surprisingly, we observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program.
     52We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies'.
     53This makes the proof of correctness for the assembler significantly more straightforward. 
     54
     55We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program.
     56Assembly language programs can manipulate concrete addresses in arbitrary ways.
     57Our proof strategy contains a notion of `good addresses' and only assembly programs that use good addresses have their semantics preserved under assembly.
     58Our strategy offers increased flexibility over the traditional approach of keeping addresses in assembly opaque.
     59In particular, we may experiment with allowing the benign manipulation of addresses.
    6060\end{abstract}
    6161
     
    6767
    6868We consider the formalisation of an assembler for the Intel MCS-51 8-bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}.
    69 This formalisation forms a major component of the EU-funded CerCo project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
     69This formalisation forms a major component of the EU-funded CerCo (`Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
    7070
    7171The MCS-51 dates from the early 1980s and is commonly called the 8051/8052.
     
    7373As a result the processor is widely used, especially in embedded systems development, where well-tested, cheap, predictable microprocessors find their niche.
    7474
    75 The MCS-51 has a relative paucity of features compared to its more modern brethren.
    76 In particular, the MCS-51 does not possess a cache or any instruction pipelining that would make predicting the concrete cost of executing a single instruction an involved process.
    77 Instead, each semiconductor foundry that produces an MCS-51 derivative is able to provide accurate timing information in clock cycles for each instruction in their derivative's instruction set.
    78 It is important to stress that this timing information, unlike in more sophisticated processors, is not an estimate, it is a `definition'.
    79 For the MCS-51, if a manufacturer states that a particular opcode takes three clock cycles to execute, then that opcode \emph{always} takes three clock cycles to execute.
    80 
    81 This predictability of timing information is especially attractive to the CerCo consortium.
    82 We are in the process of constructing a certified, concrete complexity compiler for a realistic processor, and not for building and formalising the worst case execution time tools (WCET---see~\cite{bate:wcet:2011}, amongst many others, for an application of WCET technology to microprocessors with more complex designs) that would be necessary to achieve the same result with, for example, a modern ARM or PowerPC microprocessor.
    83 
    84 As in most things, what one hand giveth, the other taketh away: the MCS-51's paucity of features, though an advantage in many respects, also quickly become a hindrance, and successfully compiling high-level code for this architecture is a cumbrous and involved process.
    85 In particular, the MCS-51 features a relatively miniscule series of memory spaces (including read-only code memory, stack and internal/external random access memory) by modern standards.
     75The MCS-51 has a relative paucity of features compared to its more modern brethren, with the lack of any caching or pipelining features means that timing of execution is predictable, making the MCS-51 very attractive for CerCo's ends.
     76Yet, as in most things, what one hand giveth the other taketh away, and the MCS-51's paucity of features---though an advantage in many respects---also quickly become a hindrance.
     77In particular, the MCS-51 features a relatively minuscule series of memory spaces by modern standards.
    8678As a result our C compiler, to have any sort of hope of successfully compiling realistic programs for embedded devices, ought to produce `tight' machine code.
    87 This is not simple and requires the use of optimisations.
    8879
    8980For example, the MCS-51 features three unconditional jump instructions: \texttt{LJMP} and \texttt{SJMP}---`long jump' and `short jump' respectively---and an 11-bit oddity of the MCS-51, \texttt{AJMP}.
    9081Each of these three instructions expects arguments in different sizes and behaves in markedly different ways: \texttt{SJMP} may only perform a `local jump'; \texttt{LJMP} may jump to any address in the MCS-51's memory space and \texttt{AJMP} may jump to any address in the current memory page.
    9182Consequently, the size of each opcode is different, and to squeeze as much code as possible into the MCS-51's limited code memory, the smallest possible opcode that will suffice should be selected.
    92 
    93 The prototype CerCo C compiler does not attempt to select the smallest jump opcode in this manner, as this was thought to unnecessarily complicate the compilation chain, making the eventual translation and formalisation of the compiler into Matita much harder.
    94 Instead, the compiler targets a bespoke assembly language, similar to `real world' assembly languages, complete with pseudoinstructions including \texttt{Jmp} and \texttt{Call} instructions.
    95 Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register also feature.
    96 This latter feature will ease any later consideration of separate compilation in the CerCo compiler.
    97 An assembler is used to expand pseudoinstructions into MCS-51 machine code.
    98 
    99 However, this assembly process is not trivial, for numerous reasons.
     83This is a well known problem to assembler writers, often referred to as `branch displacement'.
     84
     85Branch displacement is not a simple problem to solve and requires the implementation of an optimising assembler.
     86Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language.
     87We simplify the process by assuming that all assembly programs are pre-linked (i.e. we do not formalise a linker).
     88The assembler expands pseudoinstructions into MCS-51 machine code, but this assembly process is not trivial, for numerous reasons.
    10089For example, our conditional jumps to labels behave differently from their machine code counterparts.
    10190At the machine code level, all conditional jumps are `short', limiting their range.
    102 However, at the assembly level, conditional jumps may jump to a label that appears anywhere in the program, significantly liberalising the use of conditional jumps and further simplifying the design of the CerCo compiler.
    103 
    104 Further, trying to na\"ively relate assembly programs with their machine code counterparts simply does not work.
    105 Machine code programs that fetch from constant addresses in code memory or programs that combine the program counter with constant shifts do not make sense at the assembly level, since the position of instructions in code memory will be known only after assembly and optimisation.
    106 More generally, memory addresses can only be compared with other memory addresses.
    107 However, checking that memory addresses are only compared against each other at the assembly level is in fact undecidable.
    108 In short, we come to the shocking\footnote{For us, anyway.} realisation that, with optimisations, the full preservation of the semantics of all assembly programs is impossible.
    109 We believe that this revelation is significant for large formalisation projects that assume the existence of a correct assembler.
    110 Projects in this class include both the recent CompCert~\cite{compcert:2011,leroy:formal:2009} and seL4 formalisations~\cite{klein:sel4:2009,klein:sel4:2010}.
     91However, at the assembly level, conditional jumps may jump to a label that appears anywhere in the program, significantly liberalising the use of conditional jumps.
    11192
    11293Yet, the situation is even more complex than having to expand pseudoinstructions correctly.
     
    11798This is undesirable.
    11899
    119 Second, as mentioned, the CerCo consortium is in the business of constructing a verified compiler for the C programming language.
    120 However, unlike CompCert~\cite{compcert:2011,leroy:formal:2009}---which currently represents the state of the art for `industrial grade' verified compilers---CerCo considers not just the \emph{extensional correctness} of the compiler, but also its \emph{intensional correctness}.
    121 That is, CompCert focusses solely on the preservation of the \emph{meaning} of a program during the compilation process, guaranteeing that the program's meaning does not change as it is gradually transformed into assembly code.
    122 However in any realistic compiler (even the CompCert compiler!) there is no guarantee that the program's time properties are preserved during the compilation process; a compiler's `optimisations' could, in theory, even conspire to degrade the concrete complexity of certain classes of programs.
    123 CerCo aims to expand the current state of the art by producing a compiler where this temporal degradation is guaranteed not to happen.
    124 Moreover, CerCo's approach lifts a program's timing information to the source (C language) level.
    125 This has the advantage of allowing a programmer to reason about a program's intensional properties directly on the source code that they write, not on the code that the compiler produces.
    126 
    127 In order to achieve this, CerCo imposes a cost model on programs or, more specifically, on simple blocks of instructions.
     100Second, CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions.
    128101This cost model is induced by the compilation process itself, and its non-compositional nature allows us to assign different costs to identical blocks of instructions depending on how they are compiled.
    129102In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it.
     
    191164Policies do not exist in only a limited number of circumstances: namely, if a pseudoinstruction attempts to jump to a label that does not exist, or the program is too large to fit in code memory, even after shrinking jumps according to the best policy.
    192165The first circumstance is an example of a serious compiler error, as an ill-formed assembly program was generated, and does not (and should not) count as a mark against the completeness of the assembler.
    193 We plan to employ dependent types in CerCo in order to restrict the domain of the compiler to those programs that are `semantically correct' and should be compiled.
    194 In particular, in CerCo we are also interested in the completeness of the compilation process, whereas previous formalisations only focused on correctness.
    195166
    196167The rest of this paper is a detailed description of our proof that is, in part, still a work in progress.
     
    277248Each machine code instruction $\mathtt{I^i_j}$ is then assembled, using the \texttt{assembly1} function, into a list of bytes.
    278249This process is iterated for each pseudoinstruction, before the lists are flattened into a single bit list representation of the original assembly program.
    279 
    280 By inspecting the above diagram, it would appear that the best way to proceed with a proof that the assembly process does not change the semantics of an assembly program is by proving the same independently for
    281 \texttt{expand\_pseudo\_instruction} and for \texttt{assembly1}.
    282 %Namely, we first expand any and all pseudoinstructions into lists of machine instructions, and provide a proof that this process does not change our program's semantics.
    283 %Finally, we assemble all machine code instructions into machine code---lists of bytes---and prove once more that this process does not have an adverse effect on a program's semantics.
    284 %By composition, we then have that the whole assembly process is semantics preserving.
    285 This is a tempting approach to the proof, but ultimately the wrong approach.
    286 In particular, to expand a pseudoinstruction we need to know the address at which the expanded instructions will be located, for instance to know if a short jump is possible.
    287 That address is a function of the \emph{machine code} generated for the pseudoinstructions already expanded.
    288 Thus, we must assemble each pseudoinstruction into machine code before moving on, and this must be eventually reflected in the proof too.
    289 Therefore we will have lemmas proving correctness for \texttt{assembly1}, and for the composition of \texttt{assembly1} and
    290 \texttt{expand\_pseudo\_instruction}, but not for \texttt{expand\_pseudo\_instruction} alone.
    291250
    292251% ---------------------------------------------------------------------------- %
     
    659618trivial.
    660619
    661 \bibliography{cpp-2011.bib}
     620\bibliography{cpp-2012-asm.bib}
    662621
    663622\end{document}\renewcommand{\verb}{\lstinline}
Note: See TracChangeset for help on using the changeset viewer.