source: Papers/sttt/main.tex @ 3512

Last change on this file since 3512 was 3512, checked in by mulligan, 5 years ago

more work on the introduction

  • Property svn:executable set to *
File size: 52.7 KB
22\title{On the proof of correctness of a verified optimising assembler
23\thanks{Research supported by the CerCo project, within the Future and Emerging Technologies (FET) programme of the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number 243881}}
24\author{Jaap Boender \and Dominic P. Mulligan \and Claudio Sacerdoti Coen}
25%\institute{Department of Computer Science, University of Middlesex \and Computer Laboratory, University of Cambridge \and Dipartimento di Informatica, University of Bologna}
30Optimising assemblers face the `branch displacement' or `jump encoding' problem, that is: how best to choose between concrete machine code jump instructions---typically of differing instruction and offset sizes---when expanding pseudo-instructions.
31Ideally, on space constrained hardware, an optimising assembler would choose the set of pseudo-instruction expansions that minimises the size of the resulting machine code program, a task that is provably \textsc{np}-hard.
33As part of CerCo (`Certified Complexity')---an \textsc{eu}-funded project to develop a verified concrete complexity preserving compiler for a large subset of the C programming language---we have implemented and proved correct an optimising assembler within the Matita interactive theorem prover.
34Our assembler targets the instruction set of a typical micro-controller, the Intel \textsc{mcs-51} series.
35As is common with embedded systems development this micro-controller has a paucity of available code memory and therefore we face an additional pressure in reducing the size of any assembled machine code program.
36Out of necessity, then, our assembler implements an algorithm for solving the branch displacement problem, and we have proved that this algorithm is correct.
38However, this efficient expansion of jump pseudoinstructions into machine code equivalents is complex, and therefore could unneccessarily complicate the proof of correctness for the rest of the assembler.
39We therefore isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the rest of the proof of correctness more straightforward.
40Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
41Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
42In particular, our approach permits experimentation with the benign manipulation of addresses.
44We discuss wider issues associated with a proof of correctness for an assembler, detail our algorithm solving the `branch displacement' problem, and discuss our proof of correctness, employing `policies', for the assembler.
46%\keywords{Formal verification, interactive theorem proving, assembler, branch displacement optimisation, CerCo (`Certified Complexity'), MCS-51 microcontroller, Matita proof assistant}
51We consider the formalisation of an assembler for the Intel MCS-51 8-bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}.
52This formalisation forms a major component of the EU-funded CerCo (`Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
53Though this document describes the formalisation of an assembler for a specific microprocessor, we believe that the techniques described are general, and should be of interest to anybody providing a proof of correctness for an optimising assembler.
55The \textsc{mcs-51}, commonly called the 8051, is an 8-bit Harvard architecture \textsc{cisc} instruction set microprocessor dating from the early 1980s.
56An extended variant, the \textsc{mcs-52}, or 8052, was subsequently released adding extra \textsc{ram} and \textsc{rom} above and beyond that offered by the original \textsc{mcs-51}, along with an additional timer.
57Originally manufactured and marketed by Intel, the microprocessor quickly gained and maintained its popularity, especially in embedded systems development, with dozens of semiconductor foundries subsequently releasing derivative copies, many of which are still manufactured today.
59Admittedly, a typical \textsc{mcs-51} derivative lacks many features compared to their more modern brethren.
60In particular, most derivatives lack a cache, any pipelining features, branch speculation, out-of-order execution, or other advanced features we now take for granted on state-of-the-art microprocessors.
61However, as a direct result of this simplicity, program execution is predictable: as part of the processor's specification provided by the manufacturer we can be sure that a \texttt{NOP} instruction will always take exactly one processor cycle to execute, whilst an \texttt{ADD} instruction always takes two processor cycles, and so on and so forth for all instructions in the processor's instruction set.
63Certainty surrounding execution times makes the \textsc{mcs-51} especially attractive for certain embedded development tasks.
64Moreover, this same certainty also makes the \textsc{mcs-51} especially attractive for CerCo's ends, wherein we aim to produce a verified concrete-complexity preserving compiler that lifts a cost model for a C source file---induced on the program's machine code image under compilation---back through the compilation chain.
65This cost model will ultimately be made manifest as complexity assertions annotating the original C source code, and these annotations may then be used by the programmer, perhaps in conjunction with external reasoning tools, to derive accurate resource bounds for their program.
66As a result, knowing that a given block of machine code instructions will take a fixed number of cycles to execute, irrespective of the program context or processor state when executed, allows us to generate a cycle-accurate cost model on the original C source code.
69To do this, we must solve the `branch displacement' problem---deciding how best to expand pseudojumps to labels in assembly language to machine code jumps.
70This problem, also known as `jump encoding', is
71a well-known problem amongst assembler implementors~\cite{Hyde2006} and arises when pseudojumps can be expanded
72in different ways to real machine instructions, but the different expansions
73are not equivalent (e.g. jumps that differ in the distance that they `span', or instructions that differ in the number of clock cycles needed to completely execute them) and are not always
74correct (e.g. correctness is only up to global constraints over the compiled
75code). For instance, some jump instructions (short jumps) are very small
76and can be executed quickly, but they can only reach destinations within a
77certain distance from the current instruction. When the destinations are
78too far away, larger and slower long jumps must be used. The use of a long jump may
79augment the distance between another pseudojump and its target, forcing
80another long jump use, in a cascade. The job of the optimising
81compiler (assembler) is to individually expand every pseudo-instruction in such a way
82that all global constraints are satisfied and that the compiled program
83is minimal in size and faster in concrete time complexity.
84This problem is known to be computationally hard for most CISC architectures (\textsc{np}-hard, see~\cite{hyde:branch:2006}).
86To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target.
87Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language.
88We further simplify by ignoring linking, assuming that all our assembly programs are pre-linked.
90The requirements of the CerCo project add yet more complications to our proof of correctness, as we must also address a cost model.
91CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions.
92This cost model is induced by the compilation process itself, and its non-compositional nature allows us to assign different costs to identical C statements depending on how they are compiled.
93In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it.
94At the assembler level, this is reflected by our need to induce a cost
95model on the assembly code as a function of the assembly program and the
96strategy used to solve the branch displacement problem. In particular, our
97optimising assembler should also return a map that assigns a cost (in clock
98cycles) to every instruction in the source program. We expect the induced cost
99to be preserved by the assembler: we will prove that the compiled code
100tightly simulates the source code by taking exactly the predicted amount of
102Note that the temporal `tightness' of the simulation is a fundamental prerequisite
103of the correctness of the simulation, as some functions of the MCS-51---timers and \textsc{i/o}---depend on the microprocessor's clock.
104If the pseudo- and concrete clock differ the result of an \textsc{i/o} operation may not be preserved.
126% This problem is the subject of the present paper. After introducing the problem
127% in more detail, we will discuss the solutions used by other compilers, present
128% the algorithm we use in the CerCo assembler, and discuss its verification,
129% that is the proofs of termination and correctness using the Matita proof
130% assistant~\cite{Asperti2007}.
132% Formulating the final statement of correctness and finding the loop invariants
133% have been non-trivial tasks and are, indeed, the main contribution of this
134% paper. It has required considerable care and fine-tuning to formulate not
135% only the minimal statement required for the ulterior proof of correctness of
136% the assembler, but also the minimal set of invariants needed for the proof
137% of correctness of the algorithm.
139% The research presented in this paper has been executed within the CerCo project
140% which aims at formally verifying a C compiler with cost annotations. The
141% target architecture for this project is the MCS-51, whose instruction set
142% contains span-dependent instructions. Furthermore, its maximum addressable
143% memory size is very small (64 Kb), which makes it important to generate
144% programs that are as small as possible. With this optimisation, however, comes increased complexity and hence increased possibility for error. We must make sure that the branch instructions are encoded correctly, otherwise the assembled program will behave unpredictably.
146% All Matita files related to this development can be found on the CerCo
147% website, \url{}. The specific part that contains the
148% branch displacement algorithm is in the {\tt ASM} subdirectory, in the files
149% {\tt}, {\tt} and {\tt}.
151\section{The branch displacement optimisation problem}
153In most modern instruction sets that have them, the only span-dependent
154instructions are branch instructions. Taking the ubiquitous x86-64 instruction
155set as an example, we find that it contains eleven different forms of the
156unconditional branch instruction, all with different ranges, instruction sizes
157and semantics (only six are valid in 64-bit mode, for example). Some examples
158are shown in Figure~\ref{f:x86jumps} (see also~\cite{IntelDev}).
164Instruction & Size (bytes) & Displacement range \\
166Short jump & 2 & -128 to 127 bytes \\
167Relative near jump & 5 & $-2^{32}$ to $2^{32}-1$ bytes \\
168Absolute near jump & 6 & one segment (64-bit address) \\
169Far jump & 8 & entire memory (indirect jump) \\
173\caption{List of x86 branch instructions}
177The chosen target architecture of the CerCo project is the Intel MCS-51, which
178features three types of branch instructions (or jump instructions; the two terms
179are used interchangeably), as shown in Figure~\ref{f:mcs51jumps}.
185Instruction & Size    & Execution time & Displacement range \\
186            & (bytes) & (cycles) & \\
188SJMP (`short jump') & 2 & 2 & -128 to 127 bytes \\
189AJMP (`absolute jump') & 2 & 2 & one segment (11-bit address) \\
190LJMP (`long jump') & 3 & 3 & entire memory \\
194\caption{List of MCS-51 branch instructions}
198Conditional branch instructions are only available in short form, which
199means that a conditional branch outside the short address range has to be
200encoded using three branch instructions (for instructions whose logical
201negation is available, it can be done with two branch instructions, but for
202some instructions this is not the case). The call instruction is
203only available in absolute and long forms.
205Note that even though the MCS-51 architecture is much less advanced and much
206simpler than the x86-64 architecture, the basic types of branch instruction
207remain the same: a short jump with a limited range, an intra-segment jump and a
208jump that can reach the entire available memory.
210Generally, in code fed to the assembler as input, the only
211difference between branch instructions is semantics, not span. This
212means that a distinction is made between an unconditional branch and the
213several kinds of conditional branch, but not between their short, absolute or
214long variants.
216The algorithm used by the assembler to encode these branch instructions into
217the different machine instructions is known as the {\em branch displacement
218algorithm}. The optimisation problem consists of finding as small an encoding as
219possible, thus minimising program length and execution time.
221Similar problems, e.g. the branch displacement optimisation problem for other
222architectures, are known to be NP-complete~\cite{Robertson1979,Szymanski1978},
223which could make finding an optimal solution very time-consuming.
225The canonical solution, as shown by Szymanski~\cite{Szymanski1978} or more
226recently by Dickson~\cite{Dickson2008} for the x86 instruction set, is to use a
227fixed point algorithm that starts with the shortest possible encoding (all
228branch instruction encoded as short jumps, which is likely not a correct
229solution) and then iterates over the source to re-encode those branch
230instructions whose target is outside their range.
232\subsection*{Adding absolute jumps}
234In both papers mentioned above, the encoding of a jump is only dependent on the
235distance between the jump and its target: below a certain value a short jump
236can be used; above this value the jump must be encoded as a long jump.
238Here, termination of the smallest fixed point algorithm is easy to prove. All
239branch instructions start out encoded as short jumps, which means that the
240distance between any branch instruction and its target is as short as possible
241(all the intervening jumps are short).
242If, in this situation, there is a branch instruction $b$ whose span is not
243within the range for a short jump, we can be sure that we can never reach a
244situation where the span of $j$ is so small that it can be encoded as a short
245jump. This argument continues to hold throughout the subsequent iterations of
246the algorithm: short jumps can change into long jumps, but not \emph{vice versa},
247as spans only increase. Hence, the algorithm either terminates early when a fixed
248point is reached or when all short jumps have been changed into long jumps.
250Also, we can be certain that we have reached an optimal solution: a short jump
251is only changed into a long jump if it is absolutely necessary.
253However, neither of these claims (termination nor optimality) hold when we add
254the absolute jump. With absolute jumps, the encoding of a branch
255instruction no longer depends only on the distance between the branch
256instruction and its target. An absolute jump is possible when instruction and
257target are in the same segment (for the MCS-51, this means that the first 5
258bytes of their addresses have to be equal). It is therefore entirely possible
259for two branch instructions with the same span to be encoded in different ways
260(absolute if the branch instruction and its target are in the same segment,
261long if this is not the case).
267    jmp X
268    \ldots
269L\(\sb{0}\): \ldots
270% Start of new segment if
271% jmp X is encoded as short
272    \ldots
273    jmp L\(\sb{0}\)
275\caption{Example of a program where a long jump becomes absolute}
282L\(\sb{0}\): jmp X
283X:  \ldots
284    \ldots
285L\(\sb{1}\): \ldots
286% Start of new segment if
287% jmp X is encoded as short
288    \ldots
289    jmp L\(\sb{1}\)
290    \ldots
291    jmp L\(\sb{1}\)
292    \ldots
293    jmp L\(\sb{1}\) 
294    \ldots
296\caption{Example of a program where the fixed-point algorithm is not optimal}
301This invalidates our earlier termination argument: a branch instruction, once encoded
302as a long jump, can be re-encoded during a later iteration as an absolute jump.
303Consider the program shown in Figure~\ref{f:term_example}. At the start of the
304first iteration, both the branch to {\tt X} and the branch to $\mathtt{L}_{0}$
305are encoded as small jumps. Let us assume that in this case, the placement of
306$\mathtt{L}_{0}$ and the branch to it are such that $\mathtt{L}_{0}$ is just
307outside the segment that contains this branch. Let us also assume that the
308distance between $\mathtt{L}_{0}$ and the branch to it is too large for the
309branch instruction to be encoded as a short jump.
311All this means that in the second iteration, the branch to $\mathtt{L}_{0}$ will
312be encoded as a long jump. If we assume that the branch to {\tt X} is encoded as
313a long jump as well, the size of the branch instruction will increase and
314$\mathtt{L}_{0}$ will be `propelled' into the same segment as its branch
315instruction, because every subsequent instruction will move one byte forward.
316Hence, in the third iteration, the branch to $\mathtt{L}_{0}$ can be encoded as
317an absolute jump. At first glance, there is nothing that prevents us from
318constructing a configuration where two branch instructions interact in such a
319way as to iterate indefinitely between long and absolute encodings.
321This situation mirrors the explanation by Szymanski~\cite{Szymanski1978} of why
322the branch displacement optimisation problem is NP-complete. In this explanation,
323a condition for NP-completeness is the fact that programs be allowed to contain
324{\em pathological} jumps. These are branch instructions that can normally not be
325encoded as a short(er) jump, but gain this property when some other branch
326instructions are encoded as a long(er) jump. This is exactly what happens in
327Figure~\ref{f:term_example}. By encoding the first branch instruction as a long
328jump, another branch instruction switches from long to absolute (which is
331In addition, our previous optimality argument no longer holds. Consider the
332program shown in Figure~\ref{f:opt_example}. Suppose that the distance between
333$\mathtt{L}_{0}$ and $\mathtt{L}_{1}$ is such that if {\tt jmp X} is encoded
334as a short jump, there is a segment border just after $\mathtt{L}_{1}$. Let
335us also assume that all three branches to $\mathtt{L}_{1}$ are in the same
336segment, but far enough away from $\mathtt{L}_{1}$ that they cannot be encoded
337as short jumps.
339Then, if {\tt jmp X} were to be encoded as a short jump, which is clearly
340possible, all of the branches to $\mathtt{L}_{1}$ would have to be encoded as
341long jumps. However, if {\tt jmp X} were to be encoded as a long jump, and
342therefore increase in size, $\mathtt{L}_{1}$ would be `propelled' across the
343segment border, so that the three branches to $\mathtt{L}_{1}$ could be encoded
344as absolute jumps. Depending on the relative sizes of long and absolute jumps,
345this solution might actually be smaller than the one reached by the smallest
346fixed point algorithm.
348\section{Our algorithm}
350\subsection{Design decisions}
352Given the NP-completeness of the problem, finding optimal solutions
353(using, for example, a constraint solver) can potentially be very costly.
355The SDCC compiler~\cite{SDCC2011}, which has a backend targeting the MCS-51
356instruction set, simply encodes every branch instruction as a long jump
357without taking the distance into account. While certainly correct (the long
358jump can reach any destination in memory) and a very fast solution to compute,
359it results in a less than optimal solution in terms of output size and
360execution time.
362On the other hand, the {\tt gcc} compiler suite, while compiling
363C on the x86 architecture, uses a greatest fix point algorithm. In other words,
364it starts with all branch instructions encoded as the largest jumps
365available, and then tries to reduce the size of branch instructions as much as
368Such an algorithm has the advantage that any intermediate result it returns
369is correct: the solution where every branch instruction is encoded as a large
370jump is always possible, and the algorithm only reduces those branch
371instructions whose destination address is in range for a shorter jump.
372The algorithm can thus be stopped after a determined number of steps without
373sacrificing correctness.
375The result, however, is not necessarily optimal. Even if the algorithm is run
376until it terminates naturally, the fixed point reached is the {\em greatest}
377fixed point, not the least fixed point. Furthermore, {\tt gcc} (at least for
378the x86 architecture) only uses short and long jumps. This makes the algorithm
379more efficient, as shown in the previous section, but also results in a less
380optimal solution.
382In the CerCo assembler, we opted at first for a least fixed point algorithm,
383taking absolute jumps into account.
385Here, we ran into a problem with proving termination, as explained in the
386previous section: if we only take short and long jumps into account, the jump
387encoding can only switch from short to long, but never in the other direction.
388When we add absolute jumps, however, it is theoretically possible for a branch
389instruction to switch from absolute to long and back, as previously explained.
390Proving termination then becomes difficult, because there is nothing that
391precludes a branch instruction from oscillating back and forth between absolute
392and long jumps indefinitely.
394To keep the algorithm in the same complexity class and more easily
395prove termination, we decided to explicitly enforce the `branch instructions
396must always grow longer' requirement: if a branch instruction is encoded as a
397long jump in one iteration, it will also be encoded as a long jump in all the
398following iterations. Therefore the encoding of any branch instruction
399can change at most two times: once from short to absolute (or long), and once
400from absolute to long.
402There is one complicating factor. Suppose that a branch instruction is encoded
403in step $n$ as an absolute jump, but in step $n+1$ it is determined that
404(because of changes elsewhere) it can now be encoded as a short jump. Due to
405the requirement that the branch instructions must always grow longer,
406the branch encoding will be encoded as an absolute jump in step
407$n+1$ as well.
409This is not necessarily correct. A branch instruction that can be
410encoded as a short jump cannot always also be encoded as an absolute jump, as a
411short jump can bridge segments, whereas an absolute jump cannot. Therefore,
412in this situation we have decided to encode the branch instruction as a long
413jump, which is always correct.
415The resulting algorithm, therefore, will not return the least fixed point, as
416it might have too many long jumps. However, it is still better than the
417algorithms from SDCC and {\tt gcc}, since even in the worst case, it will still
418return a smaller or equal solution.
420Experimenting with our algorithm on the test suite of C programs included with
421gcc 2.3.3 has shown that on average, about 25 percent of jumps are encoded as short or absolute jumps by the algorithm. As not all instructions are jumps, this does not make for a large reduction in size, but it can make for a reduction in execution time: if jumps
422are executed multiple times, for example in loops, the fact that short jumps take less cycles to execute than long jumps can have great effect.
424As for complexity, there are at most $2n$ iterations, where $n$ is the number of
425branch instructions. Practical tests within the CerCo project on small to
426medium pieces of code have shown that in almost all cases, a fixed point is
427reached in 3 passes. Only in one case did the algorithm need 4. This is not surprising: after all, the difference between short/absolute and
428long jumps is only one byte (three for conditional jumps). For a change from
429short/absolute to long to have an effect on other jumps is therefore relatively
430uncommon, which explains why a fixed point is reached so quickly.
432\subsection{The algorithm in detail}
434The branch displacement algorithm forms part of the translation from
435pseudocode to assembler. More specifically, it is used by the function that
436translates pseudo-addresses (natural numbers indicating the position of the
437instruction in the program) to actual addresses in memory. Note that in pseudocode, all instructions are of size 1.
439Our original intention was to have two different functions, one function
440$\mathtt{policy}: \mathbb{N} \rightarrow \{\mathtt{short\_jump},
441\mathtt{absolute\_jump}, \mathtt{long\_jump}\}$ to associate jumps to their
442intended encoding, and a function $\sigma: \mathbb{N} \rightarrow
443\mathtt{Word}$ to associate pseudo-addresses to machine addresses. $\sigma$
444would use $\mathtt{policy}$ to determine the size of jump instructions. This turned out to be suboptimal from the algorithmic point of view and
445impossible to prove correct.
447From the algorithmic point of view, in order to create the $\mathtt{policy}$
448function, we must necessarily have a translation from pseudo-addresses
449to machine addresses (i.e. a $\sigma$ function): in order to judge the distance
450between a jump and its destination, we must know their memory locations.
451Conversely, in order to create the $\sigma$ function, we need to have the
452$\mathtt{policy}$ function, otherwise we do not know the sizes of the jump
453instructions in the program.
455Much the same problem appears when we try to prove the algorithm correct: the
456correctness of $\mathtt{policy}$ depends on the correctness of $\sigma$, and
457the correctness of $\sigma$ depends on the correctness of $\mathtt{policy}$.
459We solved this problem by integrating the $\mathtt{policy}$ and $\sigma$
460algorithms. We now have a function
461$\sigma: \mathbb{N} \rightarrow \mathtt{Word} \times \mathtt{bool}$ which
462associates a pseudo-address to a machine address. The boolean denotes a forced
463long jump; as noted in the previous section, if during the fixed point
464computation an absolute jump changes to be potentially re-encoded as a short
465jump, the result is actually a long jump. It might therefore be the case that
466jumps are encoded as long jumps without this actually being necessary, and this
467information needs to be passed to the code generating function.
469The assembler function encodes the jumps by checking the distance between
470source and destination according to $\sigma$, so it could select an absolute
471jump in a situation where there should be a long jump. The boolean is there
472to prevent this from happening by indicating the locations where a long jump
473should be encoded, even if a shorter jump is possible. This has no effect on
474correctness, since a long jump is applicable in any situation.
480  \State $\langle added, pc, sigma \rangle \gets acc$
481  \If {$instr$ is a backward jump to $j$}
482    \State $length \gets \mathrm{jump\_size}(pc,sigma_1(labels(j)))$
483    \Comment compute jump distance
484  \ElsIf {$instr$ is a forward jump to $j$}
485    \State $length \gets \mathrm{jump\_size}(pc,old\_sigma_1(labels(j))+added)$
486  \EndIf
487  \State $old\_length \gets \mathrm{old\_sigma_1}(ppc)$
488  \State $new\_length \gets \mathrm{max}(old\_length, length)$
489  \Comment length must never decrease
490  \State $old\_size \gets \mathrm{old\_sigma_2}(ppc)$
491  \State $new\_size \gets \mathrm{instruction\_size}(instr,new\_length)$
492  \Comment compute size in bytes
493  \State $new\_added \gets added+(new\_size-old\_size)$
494  \Comment keep track of total added bytes
495  \State $new\_sigma \gets old\_sigma$
496  \State $new\_sigma_1(ppc+1) \gets pc+new\_size$
497  \State $new\_sigma_2(ppc) \gets new\_length$
498  \Comment update $\sigma$ \\
499  \Return $\langle new\_added, pc+new\_size, new\_sigma \rangle$
502\caption{The heart of the algorithm}
506The algorithm, shown in Figure~\ref{f:jump_expansion_step}, works by folding the
507function {\sc f} over the entire program, thus gradually constructing $sigma$.
508This constitutes one step in the fixed point calculation; successive steps
509repeat the fold until a fixed point is reached. We have abstracted away the case where an instruction is not a jump, since the size of these instructions is constant.
511Parameters of the function {\sc f} are:
513  \item a function $labels$ that associates a label to its pseudo-address;
514  \item $old\_sigma$, the $\sigma$ function returned by the previous
515    iteration of the fixed point calculation;
516  \item $instr$, the instruction currently under consideration;
517  \item $ppc$, the pseudo-address of $instr$;
518  \item $acc$, the fold accumulator, which contains $added$ (the number of
519  bytes added to the program size with respect to the previous iteration), $pc$
520  (the highest memory address reached so far), and of course $sigma$, the
521    $\sigma$ function under construction.
523The first two are parameters that remain the same through one iteration, the
524final three are standard parameters for a fold function (including $ppc$,
525which is simply the number of instructions of the program already processed).
527The $\sigma$ functions used by {\sc f} are not of the same type as the final
528$\sigma$ function: they are of type
529$\sigma: \mathbb{N} \rightarrow \mathbb{N} \times \{\mathtt{short\_jump},
530\mathtt{absolute\_jump},\mathtt{long\_jump}\}$; a function that associates a
531pseudo-address with a memory address and a jump length. We do this to
532ease the comparison of jump lengths between iterations. In the algorithm,
533we use the notation $sigma_1(x)$ to denote the memory address corresponding to
534$x$, and $sigma_2(x)$ for the jump length corresponding to $x$.
536Note that the $\sigma$ function used for label lookup varies depending on
537whether the label is behind our current position or ahead of it. For
538backward branches, where the label is behind our current position, we can use
539$sigma$ for lookup, since its memory address is already known. However, for
540forward branches, the memory address of the address of the label is not yet
541known, so we must use $old\_sigma$.
543We cannot use $old\_sigma$ without change: it might be the case that we have
544already increased the size of some branch instructions before, making the
545program longer and moving every instruction forward. We must compensate for this
546by adding the size increase of the program to the label's memory address
547according to $old\_sigma$, so that branch instruction spans do not get
550%Note also that we add the pc to $sigma$ at location $ppc+1$, whereas we add the
551%jump length at location $ppc$. We do this so that $sigma(ppc)$ will always
552%return a pair with the start address of the instruction at $ppc$ and the
553%length of its branch instruction (if any); the end address of the program can
554%be found at $sigma(n+1)$, where $n$ is the number of instructions in the
557\section{The proof}
559In this section, we present the correctness proof for the algorithm in more
560detail. The main correctness statement is shown, slightly simplified, in~Figure~\ref{statement}.
565\mathtt{sigma}&\omit\rlap{$\mathtt{\_policy\_specification} \equiv
566\lambda program.\lambda sigma.$} \notag\\
567  & \omit\rlap{$sigma\ 0 = 0\ \wedge$} \notag\\
568  & \mathbf{let}\ & & \omit\rlap{$instr\_list \equiv code\ program\ \mathbf{in}$} \notag\\
569  &&& \omit\rlap{$\forall ppc.ppc < |instr\_list| \rightarrow$} \notag\\
570  &&& \mathbf{let}\ && pc \equiv sigma\ ppc\ \mathbf{in} \notag\\
571  &&& \mathbf{let}\ && instruction \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc\ \mathbf{in} \notag\\
572  &&& \mathbf{let}\ && next\_pc \equiv sigma\ (ppc+1)\ \mathbf{in}\notag\\
573  &&&&& next\_pc = pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction\ \wedge\notag\\
574  &&&&& (pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction < 2^{16}\ \vee\notag\\
575  &&&&& (\forall ppc'.ppc' < |instr\_list| \rightarrow ppc < ppc' \rightarrow \notag\\
576  &&&&& \mathbf{let}\ instruction' \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc'\ \mathbf{in} \notag\\
577  &&&&&\ \mathtt{instruction\_size}\ sigma\ ppc'\ instruction' = 0)\ \wedge \notag\\
578  &&&&& pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction = 2^{16})
580\caption{Main correctness statement\label{statement}}
584Informally, this means that when fetching a pseudo-instruction at $ppc$, the
585translation by $\sigma$ of $ppc+1$ is the same as $\sigma(ppc)$ plus the size
586of the instruction at $ppc$.  That is, an instruction is placed consecutively
587after the previous one, and there are no overlaps. The rest of the statement deals with memory size: either the next instruction fits within memory ($next\_pc < 2^{16}$) or it ends exactly at the limit memory,
588in which case it must be the last translated instruction in the program (enforced by specfiying that the size of all subsequent instructions is 0: there may be comments or cost annotations that are not translated).
590Finally, we enforce that the program starts at address 0, i.e. $\sigma(0) = 0$. It may seem strange that we do not explicitly include a safety property stating that every jump instruction is of the right type with respect to its target (akin to the lemma from Figure~\ref{sigmasafe}), but this is not necessary. The distance is recalculated according to the instruction addresses from $\sigma$, which implicitly expresses safety.
592Since our computation is a least fixed point computation, we must prove
593termination in order to prove correctness: if the algorithm is halted after
594a number of steps without reaching a fixed point, the solution is not
595guaranteed to be correct. More specifically, branch instructions might be
596encoded which do not coincide with the span between their location and their
599Proof of termination rests on the fact that the encoding of branch
600instructions can only grow larger, which means that we must reach a fixed point
601after at most $2n$ iterations, with $n$ the number of branch instructions in
602the program. This worst case is reached if at every iteration, we change the
603encoding of exactly one branch instruction; since the encoding of any branch
604instruction can change first from short to absolute, and then to long, there
605can be at most $2n$ changes.
607%The proof has been carried out using the ``Russell'' style from~\cite{Sozeau2006}.
608%We have proven some invariants of the {\sc f} function from the previous
609%section; these invariants are then used to prove properties that hold for every
610%iteration of the fixed point computation; and finally, we can prove some
611%properties of the fixed point.
613\subsection{Fold invariants}
615In this section, we present the invariants that hold during the fold of {\sc f}
616over the program. These will be used later on to prove the properties of the
617iteration. During the fixed point computation, the $\sigma$ function is
618implemented as a trie for ease of access; computing $\sigma(x)$ is achieved by
619looking up the value of $x$ in the trie. Actually, during the fold, the value
620we pass along is a pair $\mathbb{N} \times \mathtt{ppc\_pc\_map}$. The first
621component is the number of bytes added to the program so far with respect to
622the previous iteration, and the second component, {\tt ppc\_pc\_map}, is the
623actual $\sigma$ trie (which we'll call $strie$ to avoid confusion).
627\mathtt{out} & \mathtt{\_of\_program\_none} \equiv \lambda prefix.\lambda strie. \notag\\
628& \forall i.i < 2^{16} \rightarrow (i > |prefix| \leftrightarrow
629 \mathtt{lookup\_opt}\ i\ (\mathtt{snd}\ strie) = \mathtt{None})
632The first invariant states that any pseudo-address not yet examined is not
633present in the lookup trie.
637\mathtt{not} & \mathtt{\_jump\_default} \equiv \lambda prefix.\lambda strie.\forall i.i < |prefix| \rightarrow\notag\\
638& \neg\mathtt{is\_jump}\ (\mathtt{nth}\ i\ prefix) \rightarrow \mathtt{lookup}\ i\ (\mathtt{snd}\ strie) = \mathtt{short\_jump}
641This invariant states that when we try to look up the jump length of a
642pseudo-address where there is no branch instruction, we will get the default
643value, a short jump.
647\mathtt{jump} & \mathtt{\_increase} \equiv \lambda pc.\lambda op.\lambda p.\forall i.i < |prefix| \rightarrow \notag\\ 
648& \mathbf{let}\  oj \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ op)\ \mathbf{in} \notag\\
649& \mathbf{let}\ j \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ p)\ \mathbf{in}\ \mathtt{jmpleq}\ oj\ j
652This invariant states that between iterations (with $op$ being the previous
653iteration, and $p$ the current one), jump lengths either remain equal or
654increase. It is needed for proving termination.
659\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact\_unsafe} \equiv \lambda prefix.\lambda strie.\forall n.n < |prefix| \rightarrow$}\notag\\
660& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ strie)\ \mathbf{with}$}\notag\\
661&&& \omit\rlap{$\mathtt{None} \Rightarrow \mathrm{False}$} \notag\\
662&&& \omit\rlap{$\mathtt{Some}\ \langle pc, j \rangle \Rightarrow$} \notag\\
663&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ strie)\ \mathbf{with}\notag\\
664&&&&& \mathtt{None} \Rightarrow \mathrm{False} \notag\\
665&&&&& \mathtt{Some}\ \langle pc_1, j_1 \rangle \Rightarrow
666    pc_1 = pc + \notag\\
667&&&&& \ \ \mathtt{instruction\_size\_jmplen}\ j\ (\mathtt{nth}\ n\ prefix)
669\caption{Temporary safety property}
673We now proceed with the safety lemmas. The lemma in
674Figure~\ref{sigmacompactunsafe} is a temporary formulation of the main
675property {\tt sigma\_policy\_specification}. Its main difference from the
676final version is that it uses {\tt instruction\_size\_jmplen} to compute the
677instruction size. This function uses $j$ to compute the span of branch
678instructions  (i.e. it uses the $\sigma$ under construction), instead
679of looking at the distance between source and destination. This is because
680$\sigma$ is still under construction; we will prove below that after the
681final iteration, {\tt sigma\_compact\_unsafe} is equivalent to the main
682property in Figure~\ref{sigmasafe} which holds at the end of the computation.
687\mathtt{sigma} & \omit\rlap{$\mathtt{\_safe} \equiv \lambda prefix.\lambda labels.\lambda old\_strie.\lambda strie.\forall i.i < |prefix| \rightarrow$} \notag\\
688& \omit\rlap{$\forall dest\_label.\mathtt{is\_jump\_to\ (\mathtt{nth}\ i\ prefix})\ dest\_label \rightarrow$} \notag\\
689& \mathbf{let} && \omit\rlap{$\ paddr \equiv \mathtt{lookup}\ labels\ dest\_label\ \mathbf{in}$} \notag\\
690& \mathbf{let} && \omit\rlap{$\ \langle j, src, dest \rangle \equiv \mathbf{if} \ paddr\ \leq\ i\ \mathbf{then}$}\notag\\
691&&&&& \mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
692&&&&& \mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
693&&&&& \mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
694&&&&& \langle j, pc\_plus\_jl, addr \rangle\notag\\
695&&&\mathbf{else} \notag\\
696&&&&&\mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
697&&&&&\mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
698&&&&&\mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
699&&&&&\langle j, pc\_plus\_jl, addr \rangle \mathbf{in}\ \notag\\
700&&&\mathbf{match} && \ j\ \mathbf{with} \notag\\
701&&&&&\mathrm{short\_jump} \Rightarrow \mathtt{short\_jump\_valid}\ src\ dest\notag\\
702&&&&&\mathrm{absolute\_jump} \Rightarrow \mathtt{absolute\_jump\_valid}\ src\ dest\notag\\
703&&&&&\mathrm{long\_jump} \Rightarrow \mathrm{True}
705\caption{Safety property}
709We compute the distance using the memory address of the instruction
710plus its size. This follows the behaviour of the MCS-51 microprocessor, which
711increases the program counter directly after fetching, and only then executes
712the branch instruction (by changing the program counter again).
714There are also some simple, properties to make sure that our policy
715remains consistent, and to keep track of whether the fixed point has been
716reached. We do not include them here in detail. Two of these properties give the values of $\sigma$ for the start and end of the program; $\sigma(0) = 0$ and $\sigma(n)$, where $n$ is the number of instructions up until now, is equal to the maximum memory address so far. There are also two properties that deal with what happens when the previous
717iteration does not change with respect to the current one. $added$ is a
718variable that keeps track of the number of bytes we have added to the program
719size by changing the encoding of branch instructions. If $added$ is 0, the program
720has not changed and vice versa.
724%& \mathtt{lookup}\ 0\ (\mathtt{snd}\ strie) = 0 \notag\\
725%& \mathtt{lookup}\ |prefix|\ (\mathtt{snd}\ strie) = \mathtt{fst}\ strie
731%& added = 0\ \rightarrow\ \mathtt{policy\_pc\_equal}\ prefix\ old\_strie\ strie \notag\\
732%& \mathtt{policy\_jump\_equal}\ prefix\ old\_strie\ strie\ \rightarrow\ added = 0
735We need to use two different formulations, because the fact that $added$ is 0
736does not guarantee that no branch instructions have changed.  For instance,
737it is possible that we have replaced a short jump with an absolute jump, which
738does not change the size of the branch instruction. Therefore {\tt policy\_pc\_equal} states that $old\_sigma_1(x) = sigma_1(x)$, whereas {\tt policy\_jump\_equal} states that $old\_sigma_2(x) = sigma_2(x)$. This formulation is sufficient to prove termination and compactness. 
740Proving these invariants is simple, usually by induction on the prefix length.
742\subsection{Iteration invariants}
744These are invariants that hold after the completion of an iteration. The main
745difference between these invariants and the fold invariants is that after the
746completion of the fold, we check whether the program size does not supersede
74764 Kb, the maximum memory size the MCS-51 can address. The type of an iteration therefore becomes an option type: {\tt None} in case
748the program becomes larger than 64 Kb, or $\mathtt{Some}\ \sigma$
749otherwise. We also no longer pass along the number of bytes added to the
750program size, but a boolean that indicates whether we have changed something
751during the iteration or not.
753If the iteration returns {\tt None}, which means that it has become too large for memory, there is an invariant that states that the previous iteration cannot
754have every branch instruction encoded as a long jump. This is needed later in the proof of termination. If the iteration returns $\mathtt{Some}\ \sigma$, the fold invariants are retained without change.
756Instead of using {\tt sigma\_compact\_unsafe}, we can now use the proper
761\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact} \equiv \lambda program.\lambda sigma.$} \notag\\
762& \omit\rlap{$\forall n.n < |program|\ \rightarrow$} \notag\\
763& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ sigma)\ \mathbf{with}$}\notag\\
764&&& \omit\rlap{$\mathrm{None}\ \Rightarrow\ \mathrm{False}$}\notag\\
765&&& \omit\rlap{$\mathrm{Some}\ \langle pc, j \rangle \Rightarrow$}\notag\\
766&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ sigma)\ \mathbf{with}\notag\\
767&&&&& \mathrm{None}\ \Rightarrow\ \mathrm{False}\notag\\
768&&&&& \mathrm{Some} \langle pc1, j1 \rangle \Rightarrow\notag\\
769&&&&& \ \ pc1 = pc + \mathtt{instruction\_size}\ n\ (\mathtt{nth}\ n\ program)
772This is almost the same invariant as ${\tt sigma\_compact\_unsafe}$, but differs in that it
773computes the sizes of branch instructions by looking at the distance between
774position and destination using $\sigma$. In actual use, the invariant is qualified: $\sigma$ is compact if there have
775been no changes (i.e. the boolean passed along is {\tt true}). This is to
776reflect the fact that we are doing a least fixed point computation: the result
777is only correct when we have reached the fixed point.
779There is another, trivial, invariant in case the iteration returns
780$\mathtt{Some}\ \sigma$: it must hold that $\mathtt{fst}\ sigma < 2^{16}$.
781We need this invariant to make sure that addresses do not overflow.
783The proof of {\tt nec\_plus\_ultra} goes as follows: if we return {\tt None},
784then the program size must be greater than 64 Kb. However, since the
785previous iteration did not return {\tt None} (because otherwise we would
786terminate immediately), the program size in the previous iteration must have
787been smaller than 64 Kb.
789Suppose that all the branch instructions in the previous iteration are
790encoded as long jumps. This means that all branch instructions in this
791iteration are long jumps as well, and therefore that both iterations are equal
792in the encoding of their branch instructions. Per the invariant, this means that
793$added = 0$, and therefore that all addresses in both iterations are equal.
794But if all addresses are equal, the program sizes must be equal too, which
795means that the program size in the current iteration must be smaller than
79664 Kb. This contradicts the earlier hypothesis, hence not all branch
797instructions in the previous iteration are encoded as long jumps.
799The proof of {\tt sigma\_compact} follows from {\tt sigma\_compact\_unsafe} and
800the fact that we have reached a fixed point, i.e. the previous iteration and
801the current iteration are the same. This means that the results of
802{\tt instruction\_size\_jmplen} and {\tt instruction\_size} are the same.
804\subsection{Final properties}
806These are the invariants that hold after $2n$ iterations, where $n$ is the
807program size (we use the program size for convenience; we could also use the
808number of branch instructions, but this is more complex). Here, we only
809need {\tt out\_of\_program\_none}, {\tt sigma\_compact} and the fact that
810$\sigma(0) = 0$.
812Termination can now be proved using the fact that there is a $k \leq 2n$, with
813$n$ the length of the program, such that iteration $k$ is equal to iteration
814$k+1$. There are two possibilities: either there is a $k < 2n$ such that this
815property holds, or every iteration up to $2n$ is different. In the latter case,
816since the only changes between the iterations can be from shorter jumps to
817longer jumps, in iteration $2n$ every branch instruction must be encoded as
818a long jump. In this case, iteration $2n$ is equal to iteration $2n+1$ and the
819fixed point is reached.
823In the previous sections we have discussed the branch displacement optimisation
824problem, presented an optimised solution, and discussed the proof of
825termination and correctness for this algorithm, as formalised in Matita.
827The algorithm we have presented is fast and correct, but not optimal; a true
828optimal solution would need techniques like constraint solvers. While outside
829the scope of the present research, it would be interesting to see if enough
830heuristics could be found to make such a solution practical for implementing
831in an existing compiler; this would be especially useful for embedded systems,
832where it is important to have as small a solution as possible.
834In itself the algorithm is already useful, as it results in a smaller solution
835than the simple `every branch instruction is long' used up until now---and with
836only 64 Kb of memory, every byte counts. It also results in a smaller solution
837than the greatest fixed point algorithm that {\tt gcc} uses. It does this
838without sacrificing speed or correctness.
840The certification of an assembler that relies on the branch displacement
841algorithm described in this paper was presented in~\cite{lastyear}.
842The assembler computes the $\sigma$ map as described in this paper and
843then works in two passes. In the first pass it builds a map
844from instruction labels to addresses in the assembly code. In the
845second pass it iterates over the code, translating every pseudo jump
846at address $src$ to a label l associated to the assembly instruction at
847address $dst$ to a jump of the size dictated by $(\sigma\ src)$ to
848$(\sigma\ dst)$. In case of conditional jumps, the translated jump may be
849implemented with a series of instructions.
851The proof of correctness abstracts over the algorithm used and only relies on
852{\tt sigma\_policy\_specification} (page~\ref{sigmapolspec}). It is a variation
853of a standard 1-to-many forward simulation proof~\cite{Leroy2009}. The
854relation R between states just maps every code address $ppc$ stored in
855registers or memory to $(\sigma\ ppc)$. To identify the code addresses,
856an additional data structure is always kept together with the source
857state and is updated by the semantics. The semantics is preserved
858only for those programs whose source code operations
859$(f\ ppc_1\ \ldots\ ppc_n)$ applied to code addresses $ppc_1 \ldots ppc_n$ are
860such that $(f\ (\sigma\ ppc_1)\ ...\ (\sigma\ ppc_n) = f\ ppc_1\ ppc_n))$. For
861example, an injective $\sigma$ preserves a binary equality test f for code
862addresses, but not pointer subtraction.
864The main lemma (fetching simulation), which relies on\\
865{\tt sigma\_policy\_specification} and is established by structural induction
866over the source code, says that fetching an assembly instruction at
867position ppc is equal to fetching the translation of the instruction at
868position $(\sigma\ ppc)$, and that the new incremented program counter is at
869the beginning of the next instruction (compactness). The only exception is
870when the instruction fetched is placed at the end of code memory and is
871followed only by dead code. Execution simulation is trivial because of the
872restriction over well behaved programs w.r.t. sigma. The condition
873$\sigma\ 0 = 0$ is necessary because the hardware model prescribes that the
874first instruction to be executed will be at address 0. For the details
877Instead of verifying the algorithm directly, another solution to the problem
878would be to run an optimisation algorithm, and then verify the safety of the
879result using a verified validator. Such a validator would be easier to verify
880than the algorithm itself and it would also be efficient, requiring only a
881linear pass over the source code to test the specification. However, it is
882surely also interesting to formally prove that the assembler never rejects
883programs that should be accepted, i.e. that the algorithm itself is correct.
884This is the topic of the current paper.
886\subsection{Related work}
888As far as we are aware, this is the first formal discussion of the branch
889displacement optimisation algorithm.
891The CompCert project is another verified compiler project.
892Their backend~\cite{Leroy2009} generates assembly code for (amongst others) subsets of the
893PowerPC and x86 (32-bit) architectures. At the assembly code stage, there is
894no distinction between the span-dependent jump instructions, so a branch
895displacement optimisation algorithm is not needed.
897%An offshoot of the CompCert project is the CompCertTSO project, which adds
898%thread concurrency and synchronisation to the CompCert compiler~\cite{TSO2011}.
899%This compiler also generates assembly code and therefore does not include a
900%branch displacement algorithm.
902%Finally, there is also the Piton stack~\cite{Moore1996}, which not only includes the
903%formal verification of a compiler, but also of the machine architecture
904%targeted by that compiler, a microprocessor called the FM9001.
905%However, this architecture does not have different
906%jump sizes (branching is simulated by assigning values to the program counter),
907%so the branch displacement problem is irrelevant.
Note: See TracBrowser for help on using the repository browser.