# Changeset 2053

Ignore:
Timestamp:
Jun 13, 2012, 11:02:37 AM (7 years ago)
Message:

Introduction changed, with many paragraphs deleted.

File:
1 edited

### Legend:

Unmodified
 r2052 \begin{abstract} We consider the formalisation of an assembler for Intel MCS-51 assembly language in the Matita proof assistant. This formalisation forms a major component of the EU-funded CerCo project, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language. We present a proof of correctness, in the Matita proof assistant, for an optimising assembler for the MCS-51 8-bit microcontroller. This assembler constitutes a major component of the EU's CerCo (Certified Complexity') project. The efficient expansion of pseudoinstructions---particularly jumps---into MCS-51 machine instructions is complex. We employ a strategy, involving the use of policies', that separates the decision making over how jumps should be expanded from the expansion process itself. This makes the proof of correctness for the assembler significantly more straightforward. We prove, under the assumption of the existence of a correct policy, that the assembly process never fails and preserves the semantics of a subset of assembly programs. Correct policies fail to exist only in a limited number of pathological circumstances. Our assembler is complete with respect to the choice of policy. Surprisingly, we observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program. We isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using policies'. This makes the proof of correctness for the assembler significantly more straightforward. We observe that it is impossible for an optimising assembler to preserve the semantics of every assembly program. Assembly language programs can manipulate concrete addresses in arbitrary ways. Our proof strategy contains a notion of good addresses' and only assembly programs that use good addresses have their semantics preserved under assembly. Our strategy offers increased flexibility over the traditional approach of keeping addresses in assembly opaque. In particular, we may experiment with allowing the benign manipulation of addresses. \end{abstract} We consider the formalisation of an assembler for the Intel MCS-51 8-bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}. This formalisation forms a major component of the EU-funded CerCo project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language. This formalisation forms a major component of the EU-funded CerCo (Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language. The MCS-51 dates from the early 1980s and is commonly called the 8051/8052. As a result the processor is widely used, especially in embedded systems development, where well-tested, cheap, predictable microprocessors find their niche. The MCS-51 has a relative paucity of features compared to its more modern brethren. In particular, the MCS-51 does not possess a cache or any instruction pipelining that would make predicting the concrete cost of executing a single instruction an involved process. Instead, each semiconductor foundry that produces an MCS-51 derivative is able to provide accurate timing information in clock cycles for each instruction in their derivative's instruction set. It is important to stress that this timing information, unlike in more sophisticated processors, is not an estimate, it is a definition'. For the MCS-51, if a manufacturer states that a particular opcode takes three clock cycles to execute, then that opcode \emph{always} takes three clock cycles to execute. This predictability of timing information is especially attractive to the CerCo consortium. We are in the process of constructing a certified, concrete complexity compiler for a realistic processor, and not for building and formalising the worst case execution time tools (WCET---see~\cite{bate:wcet:2011}, amongst many others, for an application of WCET technology to microprocessors with more complex designs) that would be necessary to achieve the same result with, for example, a modern ARM or PowerPC microprocessor. As in most things, what one hand giveth, the other taketh away: the MCS-51's paucity of features, though an advantage in many respects, also quickly become a hindrance, and successfully compiling high-level code for this architecture is a cumbrous and involved process. In particular, the MCS-51 features a relatively miniscule series of memory spaces (including read-only code memory, stack and internal/external random access memory) by modern standards. The MCS-51 has a relative paucity of features compared to its more modern brethren, with the lack of any caching or pipelining features means that timing of execution is predictable, making the MCS-51 very attractive for CerCo's ends. Yet, as in most things, what one hand giveth the other taketh away, and the MCS-51's paucity of features---though an advantage in many respects---also quickly become a hindrance. In particular, the MCS-51 features a relatively minuscule series of memory spaces by modern standards. As a result our C compiler, to have any sort of hope of successfully compiling realistic programs for embedded devices, ought to produce tight' machine code. This is not simple and requires the use of optimisations. For example, the MCS-51 features three unconditional jump instructions: \texttt{LJMP} and \texttt{SJMP}---long jump' and short jump' respectively---and an 11-bit oddity of the MCS-51, \texttt{AJMP}. Each of these three instructions expects arguments in different sizes and behaves in markedly different ways: \texttt{SJMP} may only perform a local jump'; \texttt{LJMP} may jump to any address in the MCS-51's memory space and \texttt{AJMP} may jump to any address in the current memory page. Consequently, the size of each opcode is different, and to squeeze as much code as possible into the MCS-51's limited code memory, the smallest possible opcode that will suffice should be selected. The prototype CerCo C compiler does not attempt to select the smallest jump opcode in this manner, as this was thought to unnecessarily complicate the compilation chain, making the eventual translation and formalisation of the compiler into Matita much harder. Instead, the compiler targets a bespoke assembly language, similar to real world' assembly languages, complete with pseudoinstructions including \texttt{Jmp} and \texttt{Call} instructions. Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register also feature. This latter feature will ease any later consideration of separate compilation in the CerCo compiler. An assembler is used to expand pseudoinstructions into MCS-51 machine code. However, this assembly process is not trivial, for numerous reasons. This is a well known problem to assembler writers, often referred to as branch displacement'. Branch displacement is not a simple problem to solve and requires the implementation of an optimising assembler. Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language. We simplify the process by assuming that all assembly programs are pre-linked (i.e. we do not formalise a linker). The assembler expands pseudoinstructions into MCS-51 machine code, but this assembly process is not trivial, for numerous reasons. For example, our conditional jumps to labels behave differently from their machine code counterparts. At the machine code level, all conditional jumps are short', limiting their range. However, at the assembly level, conditional jumps may jump to a label that appears anywhere in the program, significantly liberalising the use of conditional jumps and further simplifying the design of the CerCo compiler. Further, trying to na\"ively relate assembly programs with their machine code counterparts simply does not work. Machine code programs that fetch from constant addresses in code memory or programs that combine the program counter with constant shifts do not make sense at the assembly level, since the position of instructions in code memory will be known only after assembly and optimisation. More generally, memory addresses can only be compared with other memory addresses. However, checking that memory addresses are only compared against each other at the assembly level is in fact undecidable. In short, we come to the shocking\footnote{For us, anyway.} realisation that, with optimisations, the full preservation of the semantics of all assembly programs is impossible. We believe that this revelation is significant for large formalisation projects that assume the existence of a correct assembler. Projects in this class include both the recent CompCert~\cite{compcert:2011,leroy:formal:2009} and seL4 formalisations~\cite{klein:sel4:2009,klein:sel4:2010}. However, at the assembly level, conditional jumps may jump to a label that appears anywhere in the program, significantly liberalising the use of conditional jumps. Yet, the situation is even more complex than having to expand pseudoinstructions correctly. This is undesirable. Second, as mentioned, the CerCo consortium is in the business of constructing a verified compiler for the C programming language. However, unlike CompCert~\cite{compcert:2011,leroy:formal:2009}---which currently represents the state of the art for industrial grade' verified compilers---CerCo considers not just the \emph{extensional correctness} of the compiler, but also its \emph{intensional correctness}. That is, CompCert focusses solely on the preservation of the \emph{meaning} of a program during the compilation process, guaranteeing that the program's meaning does not change as it is gradually transformed into assembly code. However in any realistic compiler (even the CompCert compiler!) there is no guarantee that the program's time properties are preserved during the compilation process; a compiler's optimisations' could, in theory, even conspire to degrade the concrete complexity of certain classes of programs. CerCo aims to expand the current state of the art by producing a compiler where this temporal degradation is guaranteed not to happen. Moreover, CerCo's approach lifts a program's timing information to the source (C language) level. This has the advantage of allowing a programmer to reason about a program's intensional properties directly on the source code that they write, not on the code that the compiler produces. In order to achieve this, CerCo imposes a cost model on programs or, more specifically, on simple blocks of instructions. Second, CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions. This cost model is induced by the compilation process itself, and its non-compositional nature allows us to assign different costs to identical blocks of instructions depending on how they are compiled. In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it. Policies do not exist in only a limited number of circumstances: namely, if a pseudoinstruction attempts to jump to a label that does not exist, or the program is too large to fit in code memory, even after shrinking jumps according to the best policy. The first circumstance is an example of a serious compiler error, as an ill-formed assembly program was generated, and does not (and should not) count as a mark against the completeness of the assembler. We plan to employ dependent types in CerCo in order to restrict the domain of the compiler to those programs that are semantically correct' and should be compiled. In particular, in CerCo we are also interested in the completeness of the compilation process, whereas previous formalisations only focused on correctness. The rest of this paper is a detailed description of our proof that is, in part, still a work in progress. Each machine code instruction $\mathtt{I^i_j}$ is then assembled, using the \texttt{assembly1} function, into a list of bytes. This process is iterated for each pseudoinstruction, before the lists are flattened into a single bit list representation of the original assembly program. By inspecting the above diagram, it would appear that the best way to proceed with a proof that the assembly process does not change the semantics of an assembly program is by proving the same independently for \texttt{expand\_pseudo\_instruction} and for \texttt{assembly1}. %Namely, we first expand any and all pseudoinstructions into lists of machine instructions, and provide a proof that this process does not change our program's semantics. %Finally, we assemble all machine code instructions into machine code---lists of bytes---and prove once more that this process does not have an adverse effect on a program's semantics. %By composition, we then have that the whole assembly process is semantics preserving. This is a tempting approach to the proof, but ultimately the wrong approach. In particular, to expand a pseudoinstruction we need to know the address at which the expanded instructions will be located, for instance to know if a short jump is possible. That address is a function of the \emph{machine code} generated for the pseudoinstructions already expanded. Thus, we must assemble each pseudoinstruction into machine code before moving on, and this must be eventually reflected in the proof too. Therefore we will have lemmas proving correctness for \texttt{assembly1}, and for the composition of \texttt{assembly1} and \texttt{expand\_pseudo\_instruction}, but not for \texttt{expand\_pseudo\_instruction} alone. % ---------------------------------------------------------------------------- % trivial. \bibliography{cpp-2011.bib} \bibliography{cpp-2012-asm.bib} \end{document}\renewcommand{\verb}{\lstinline}